Submission audit — claim status ledger

How to read this page. This is a generated audit ledger for reviewers who want to trace claims back to decision records. It is not the main narrative. Start with README.md, RESULTS.md, or WRITEUP.md for the project story.

Generated from decisions/ADR-*.md frontmatter. Do not edit this file directly; update the ADR frontmatter or this generator, then run python scripts/regenerate_audit.py.

Claim index

Claim ID ADR Status Closing commit/ADR
CLAIM-001 ADR-001 Accepted 2a7b123
CLAIM-002 ADR-002 Superseded 2a7b123
CLAIM-003 ADR-003 Accepted 2a7b123
CLAIM-004 ADR-004 Superseded 2a7b123
CLAIM-005 ADR-005 Accepted 2a7b123
CLAIM-006 ADR-006 Accepted e760faf
CLAIM-007 ADR-007 Superseded e760faf
CLAIM-008 ADR-008 Accepted e760faf
CLAIM-009 ADR-009 Accepted e760faf
CLAIM-010 ADR-010 Accepted e760faf
CLAIM-011 ADR-011 Accepted e760faf
CLAIM-012 ADR-012 Accepted e760faf
CLAIM-013 ADR-013 Accepted e760faf
CLAIM-014 ADR-014 Accepted 727767c
CLAIM-015 ADR-015 Accepted 727767c
CLAIM-016 ADR-016 Accepted 8da7018
CLAIM-017 ADR-017 Accepted cfa7559
CLAIM-018 ADR-018 Accepted cfa7559
CLAIM-019 ADR-019 Accepted cfa7559
CLAIM-020 ADR-020 Accepted cfa7559
CLAIM-021 ADR-021 Accepted b750d1d
CLAIM-022 ADR-022 Accepted b750d1d
CLAIM-023 ADR-023 Accepted b750d1d
CLAIM-024 ADR-024 Accepted b750d1d
CLAIM-025 ADR-025 Accepted e335739
CLAIM-026 ADR-026 Accepted fa1ad33
CLAIM-027 ADR-027 Accepted fa1ad33
CLAIM-028 ADR-028 Accepted fa1ad33
CLAIM-029 ADR-029 Accepted fa1ad33
CLAIM-030 ADR-030 Accepted 7979dc9
CLAIM-031 ADR-031 Accepted 7979dc9
CLAIM-032 ADR-032 Accepted 7979dc9
CLAIM-033 ADR-033 Accepted 7979dc9
CLAIM-034 ADR-034 Accepted 7979dc9
CLAIM-035 ADR-035 Accepted 5427b95
CLAIM-036 ADR-036 Accepted 5427b95
CLAIM-037 ADR-037 Accepted 5427b95
CLAIM-038 ADR-038 Accepted 5427b95
CLAIM-039 ADR-039 Accepted 5427b95
CLAIM-040 ADR-040 Accepted cc7ed9f
CLAIM-041 ADR-041 Accepted ecfa2b6
CLAIM-042 ADR-042 Accepted df72b01
CLAIM-043 ADR-043 Accepted e4454c0
CLAIM-044 ADR-044 Accepted 8c053b0
CLAIM-045 ADR-045 Accepted c406f58
CLAIM-046 ADR-046 Accepted 70e34fd
CLAIM-047 ADR-047 Accepted ab8a501
CLAIM-048 ADR-048 Accepted 0fad4e1
CLAIM-049 ADR-049 Accepted 423c2c8
CLAIM-050 ADR-050 Accepted 3b16036
CLAIM-051 ADR-051 Accepted v1.0.9 (Block A; Block B carryforward to v1.1.x)
CLAIM-052 ADR-052 Accepted v1.0.3
CLAIM-053 ADR-053 Accepted v1.0.4
CLAIM-054 ADR-054 Accepted v1.0.5
CLAIM-055 ADR-055 Accepted v1.0.8
CLAIM-056 ADR-056 Accepted v1.0.8
CLAIM-057 ADR-057 Accepted v1.0.8
CLAIM-058 ADR-058 Accepted v1.0.9
CLAIM-059 ADR-059 Accepted v1.1.0
CLAIM-060 ADR-060 Accepted v1.1.0 (methodology); v1.1.1 (execution carryforward)
CLAIM-061 ADR-061 Accepted v1.1.1
CLAIM-062 ADR-062 Accepted pending implementation commit
CLAIM-063 ADR-063 Accepted v1.1.2
CLAIM-064 ADR-064 Accepted v1.2.0
CLAIM-065 ADR-065 Accepted v1.2.1
CLAIM-066 ADR-066 Accepted v1.2.2
CLAIM-067 ADR-067 Accepted v1.2.2
CLAIM-068 ADR-068 Accepted v1.2.6
CLAIM-069 ADR-069 Accepted v1.2.6
CLAIM-070 ADR-070 Accepted v1.2.8
CLAIM-071 ADR-071 Accepted 37c2b32
CLAIM-072 ADR-072 Accepted 8105f37
CLAIM-073 ADR-073 Accepted ba342c7
CLAIM-074 ADR-074 Accepted 14f0c05
CLAIM-075 ADR-075 Accepted 428971c
CLAIM-076 ADR-076 Accepted
CLAIM-077 ADR-077 Accepted
CLAIM-078 ADR-078 Accepted
CLAIM-079 ADR-079 Accepted
CLAIM-080 ADR-080 Accepted
CLAIM-081 ADR-081 Accepted

Claim details

CLAIM-001 - ADR-001: Submission deadline 2026-05-18 morning with infra-leveraged Long-scope attempt and fallback ladder

Status: Accepted

Source: SPEC_GREENFIELD.md §Brief row 304 (Submission deadline / time budget)

Closing commit/ADR: 2a7b123

Claim

The submission targets 2026-05-18 morning (~2.5 working days from 2026-05-15); scope ambition is Long-bucket (2×3 trained-rung grid + multi-seed + full OOD slate + paired-bootstrap) leveraging runpod-deploy + eval-toolkit infrastructure; fallback ladder shrinks rung count before sacrificing methodology integrity.

Acceptance criterion

All Phase 1-5 work is completed by 2026-05-18 morning; or the fallback ladder activates and the writeup honestly reports the realized rung set without pretending unrealized rungs.

CLAIM-002 - ADR-002: Deliverable format — focused PDF plus public GitHub repo as evidence locker

Status: Superseded

Source: SPEC_GREENFIELD.md §Brief row 305 (Deliverable format)

Closing commit/ADR: 2a7b123

Claim

The submission deliverable is a focused PDF (rendered from WRITEUP.md + appendices) paired with a public GitHub repo containing ADRs, code, notebooks, training manifests, per-row predictions, and topic-focused markdown spokes for any reader who wants depth beyond the PDF.

Acceptance criterion

At submission time, both artifacts exist and are mutually consistent — PDF cross-links to the repo at a stable git tag; repo contents reproduce the headline numbers in the PDF.

CLAIM-003 - ADR-003: Repository visibility — public from start (kit default ratified)

Status: Accepted

Source: SPEC_GREENFIELD.md §Brief row 306 (Repo visibility) + §Kit-Ratify row 367

Closing commit/ADR: 2a7b123

Claim

The submission GitHub repository is public from project start; the kit-level default (per .gitignore + SPEC_STRATEGY.md) is ratified without override. Transcripts remain gitignored; the brief itself is never committed.

Acceptance criterion

Repo is public on GitHub at submission time; transcripts/ remains gitignored except README.md; no brief contents committed.

CLAIM-004 - ADR-004: Reviewer profile A1+A2 with hub-and-spoke writeup structure

Status: Superseded

Source: SPEC_GREENFIELD.md §Brief row 307 (Reviewer profile + expected reading time)

Closing commit/ADR: 2a7b123

Claim

The submission is read by a dual audience — hiring manager (A1) and ML researcher (A2) — with no single reading-time bound. The PDF is the focused hub (≈ 10-15 pages, layered for both audiences); per-topic markdown spokes under WRITEUP/ carry depth that the PDF cross-links to.

Acceptance criterion

PDF has layered structure (exec summary + headlines for A1; methodology narrative + ADR-indexed appendix for A2); every section that could bloat emits a single inline link to a repo spoke; spokes are standalone-readable.

CLAIM-005 - ADR-005: Project-level methodology principles — methodology over metrics, honest evaluation preferred, structured limitations

Status: Accepted

Source: Phase 0-00 brief alignment conversation (Q4 Signals 2-3, Q7 add-on)

Closing commit/ADR: 2a7b123

Claim

The submission is governed by three project-level methodology principles surfaced during Phase 0-00 brief alignment — (1) methodology over metrics, (2) honest evaluation preferred even when models look worse, (3) structured limitations with extension conditions. These principles propagate to ADR rationale, WRITEUP tone, and downstream methodology decisions.

Acceptance criterion

Every methodology component has an ADR whose Consequences section names its extension condition; every scope-bound in the writeup is paired with the limitation it imposes and when it would or wouldn’t make sense to extend; reported metrics include cases where the model looks worse under honest eval, named as such; comparisons interpreted via effect-size + uncertainty, not via dichotomous test outcomes.

CLAIM-006 - ADR-006: Headline metrics, operating-point pinpoints, and statistical apparatus

Status: Accepted

Source: SPEC_GREENFIELD.md §Brief row 308 (Brief-mandated metrics / constraints) + §3 Eval rows 340-347

Closing commit/ADR: e760faf

Claim

The submission reports four headline metrics (AUPRC, AUROC, Recall@FPR=1%, ECE) at three operating-point pinpoints {0.1%, 1%, 5%} per rung; statistical inference is via 95% bootstrap CIs (10K iters, BCa for marginals, paired for rung-vs-rung differences) with MDE on every reported CI; multi-seed protocol is 3 seeds paired across rungs (adaptive escalation to 5 if budget permits per ADR-001 fallback ladder); no formal p-tests (estimation-over-testing, per ADR-005 Principle 2); cost-weighted thresholding is rejected as false precision and replaced by qualitative scenario discussion in WRITEUP/threshold-policy.md.

Acceptance criterion

All headline rung-tables show the four metrics with three Recall@FPR pinpoints; every reported metric has a bootstrap CI; rung-vs-rung comparisons are reported as Δ-CIs (paired bootstrap), not p-values; MDE is computed for every CI; cost-weighted thresholding is absent from the writeup and replaced by ≥3 deployment scenarios in the threshold-policy spoke.

CLAIM-007 - ADR-007: Methodology rung architecture — trained backbones, LLM-judge reference rungs, Cohen’s kappa

Status: Superseded

Source: SPEC_GREENFIELD.md §Brief row 308 (Q5-C4) + §2 Model rows 330-338

Closing commit/ADR: e760faf

Claim

The rung slate comprises six trained rungs (DeBERTa-v3 + ModernBERT × {frozen-probe, LoRA, full-FT}) plus two LLM-judge reference rungs (one OpenAI model, one Anthropic model — specific model IDs finalized in Phase 0-03), plus optional existing-classifier baselines (Lakera Guard, ProtectAI LLM-Guard) as reference scorers. Cohen’s kappa is computed pairwise across all rungs with bootstrap CIs on each kappa. Library stack — HuggingFace Transformers + PEFT + sentence-transformers; dependencies pinned via uv.lock.

Acceptance criterion

At Phase 3 close, every rung in the slate has been trained (or inference-scored for reference rungs) on the locked train/eval splits; per-row predictions are persisted for every rung × seed × fold; the pairwise kappa matrix is rendered as a heatmap in WRITEUP/reference-scorer-audit.md; LLM-judge calls are reproducible (temp=0; prompt template versioned in repo).

CLAIM-008 - ADR-008: Data scope — public-only sources, hybrid splits, NotInject inclusion (brief-level lock)

Status: Accepted

Source: SPEC_GREENFIELD.md §Brief row 308 (Q5-C5) + §1 Data rows 323-329

Closing commit/ADR: e760faf

Claim

Training and evaluation data come from public sources only (HuggingFace-hosted; dataset revisions pinned by SHA at lock time); splits are hybrid (source-disjoint Leave-One-Dataset-Out for OOD slate + random k-fold within in-distribution sources); licenses are mixed and documented per source (full audit deliverable in Phase 0-02); NotInject-style benign-trigger hard negatives are included in the OOD slate to test over-defense per the InjecGuard 2024 methodology.

Acceptance criterion

Phase 0-02 produces a full source-slate manifest (per-source row counts, license, role, HF revision SHA); Phase 1 leakage scan confirms no train-eval overlap above thresholds documented in evals/leakage_report.json; the OOD slate includes a NotInject-equivalent benign-trigger slice.

CLAIM-009 - ADR-009: Process mandates — two-tier reproducibility, marker-based testing, compute disclosure

Status: Accepted

Source: SPEC_GREENFIELD.md §Brief row 308 (Q5-C6) + §STYLE rows 350-351 + §Submission row 355

Closing commit/ADR: e760faf

Claim

Reproducibility is two-tier (laptop-only smoke + GPU-rental canonical, per SPEC_GREENFIELD row 355); testing discipline is marker-based (unit / smoke / integration / golden) with a 70% soft coverage floor on core modules; pre-commit hooks enforce discipline locally; no remote CI is set up (Tight calendar per ADR-001); notebooks are jupytext-paired with illustrative-only role (per ADR-013); GPU-hours + cost per rung are disclosed in the WRITEUP.

Acceptance criterion

A laptop-only smoke target completes the full pipeline (data load → minimal training → eval → analysis) on a developer laptop without RunPod access; canonical reproduction instructions are documented for GPU rental; tests carry markers selectable via pytest -m; per-rung GPU-hours + dollar cost are reported in WRITEUP.

CLAIM-010 - ADR-010: Scope limits with per-bound extension conditions

Status: Accepted

Source: SPEC_GREENFIELD.md §Brief row 308 (Q5-C7) + §0 Threat rows 319-322 + §2 Model row 330 + §2 Model row 331

Closing commit/ADR: e760faf

Claim

Submission scope is bounded along six axes — language (English-only), attack class (direct + indirect), input length (per-backbone cap, 512 for DeBERTa-v3 and 8K for ModernBERT), model size (sub-1B parameters), deployment surface (text-only classifier + LLM-judge reference rungs), adversarial-strength budget (static attacks only). Each bound is paired with the explicit conditions under which extending past it would or wouldn’t make sense, per ADR-005 Principle 3 (structured limitations with extension conditions). Decision-relevant extension framings replace generic “future work” boilerplate.

Acceptance criterion

WRITEUP/limitations-and-future-work.md spoke enumerates all six scope bounds with their extension conditions; every ADR for a methodology component (per CLAUDE.md anti-pattern) names the bound it operates within.

CLAIM-011 - ADR-011: Methodology guarantees — eight banned approaches surfaced as project commitments

Status: Accepted

Source: SPEC_GREENFIELD.md §Brief row 308 (Q5-C8) + CLAUDE.md anti-patterns

Closing commit/ADR: e760faf

Claim

Eight methodology guarantees are committed and surfaced in the writeup (1-paragraph summary in PDF + WRITEUP/methodology-guarantees.md spoke), namely — (1) no tuning on test data; (2) no train-eval overlap (leakage scan + cross-source benign dedup); (3) no closed-source datasets; (4) no hand-labeling without inter-rater agreement (audited via Cohen’s kappa, per ADR-007); (5) no cherry-picking seeds (seed-aggregate + per-seed transparency); (6) no adaptive threshold selection on test data (thresholds on validation only); (7) no data leakage train→eval (Phase 5 pre-submission grep suite); (8) no untracked methodology components (every component has an ADR).

Acceptance criterion

WRITEUP/methodology-guarantees.md spoke enumerates all eight guarantees with the verification mechanism for each; the PDF exec summary or methodology narrative cites the spoke; the pre-Phase-5 grep suite runs cleanly with no leakage detected.

CLAIM-012 - ADR-012: Soft-signals naming discipline and external artifact engagement set

Status: Accepted

Source: SPEC_GREENFIELD.md §Brief row 308 (Q5-C9 + Q5-C10)

Closing commit/ADR: e760faf

Claim

Eight soft signals are explicitly named in the WRITEUP — calibration, OOD honesty, reproducibility, writing clarity, engineering taste, methodology over results, time-budgeted craftsmanship, honesty about limitations — each aligned by prior ADR locks and cited at the relevant section as the reason a particular methodology choice was made. The default external artifact engagement set covers — Lakera Guard / ProtectAI as reference scorer rungs (per ADR-007); JailbreakBench (Chao 2024), HarmBench (Mazeika 2024), InjecAgent (Zhan 2024) cited and acknowledged; NotInject (Li & Liu 2024) replicated via OOD inclusion; BIPIA (Yi 2023) compared against; PromptShield (Microsoft 2024) acknowledged as Recall@FPR-pinpoint influence; OWASP LLM Top 10 cited if industry-standard threat-model framing is relevant.

Acceptance criterion

Every soft signal listed appears explicitly in the WRITEUP at the relevant methodology section (“the brief emphasizes X, so we…”); every external artifact in the engagement set is either cited in the WRITEUP references or implemented as a rung / OOD slice as the engagement level mandates.

CLAIM-013 - ADR-013: Kit-ratify bulk — Phase 0 strictness, brief-intake protocol, notebook role and RunPod persistence

Status: Accepted

Source: SPEC_GREENFIELD.md §Kit-Ratify rows 365-368 + Q9 walk surfacing notebook-role and RunPod persistence

Closing commit/ADR: e760faf

Claim

Phase 0 strictness ratifies the kit default (all [OPEN] rows resolved before Phase 1); brief-intake protocol is live Phase 0-00 sub-session (kit default; transcript captured for ADR linkage); repo visibility is public from start (re-affirms ADR-003 from the kit-ratify frame); notebooks are jupytext-paired (kit default) with explicit illustrative-only role — GPU training runs are Python scripts, not notebook cells, and all RunPod-generated artifacts (per-row predictions, training manifests, checkpoints, logs, results JSON) are persisted to durable storage (local + HF Hub or S3-equivalent) before any pod is torn down.

Acceptance criterion

Phase 1 cannot start until every [OPEN] ledger row is locked or explicitly deferred; the Phase 0-00 transcript references this ADR; notebooks in this repo contain no GPU-bound training code; before any RunPod pod is torn down, a pre-teardown persistence checklist is verified (per-row predictions present at evals/predictions/, training manifests present at training/runs/, eval results at evals/results.json, checkpoints pushed to HF Hub or downloaded locally).

CLAIM-014 - ADR-014: Threat-model bundle — attack-class scope, language, length cap, truncation policy

Status: Accepted

Source: SPEC_GREENFIELD.md §0 Threat rows 319-322 + Phase 0-01 walk

Closing commit/ADR: 727767c

Claim

Phase 0-01 locks four §0 Threat decisions as a single bundle — (Q1) attack-class scope is direct injection as primary trained scope plus indirect injection as zero-shot OOD probe via BIPIA (no labeled indirect training data exists in the dossier-vetted slate; the asymmetry is intentional and surfaced as a methodology finding rather than papered over); (Q2) language scope is English-only (reaffirms ADR-010 Bound 1; every dossier-vetted eval slice is English; multilingual extension would require both data and researcher-auditability that this submission lacks); (Q3) length cap is per-backbone native (refines ADR-010 Bound 3; under ADR-015’s single-backbone refinement this resolves to ModernBERT 8192 for the trained rung, with reference rungs running at their published native caps — ProtectAI deberta-v3 at 512, Lakera as-API, LLM-judges full-sample); (Q4) truncation policy is adaptive chunked scoring with max-pool aggregation and stride equal to cap divided by two at evaluation time, plus head-truncation at training time (HF default since training-positives are short and the cap rarely bites at train time); a chunked-vs-head ablation on the BIPIA slice is mandatory and lives in WRITEUP/truncation-ablation.md as a methodology-quantification artifact rather than a hidden detail.

Acceptance criterion

SPEC_GREENFIELD ledger rows 319/320/321/322 carry locked-to-X (see ADR-014) status; the SPEC_SHEET §3.3 truncation slot carries the bracketed LOCKED marker naming adaptive-chunked-max-pool (per ADR-014); tests/test_invariants.py contains skip-marked stubs test_trained_backbone_modernbert_only_invariant and test_truncation_policy_adaptive_chunked_max_pool; the Phase 1 length-histogram audit produces evals/length_histograms.{train,ood}.json with per-slice quantiles on the ModernBERT-base tokenizer; if BIPIA samples with token-length above 8192 exceed 15 percent of the slice, a superseding ADR-016 adjusts the chunk-stride or aggregation policy; the WRITEUP/truncation-ablation.md spoke reports delta-AUROC and delta-Recall-at-FPR-1-percent between adaptive-chunked and head-truncation on the BIPIA slice.

CLAIM-015 - ADR-015: Rung architecture refinement — ModernBERT-base only trained slate, supersedes ADR-007

Status: Accepted

Source: SPEC_GREENFIELD.md §2 Model row 330 (Backbone choice) + ADR-014 Q3/Q4 walk

Closing commit/ADR: 727767c

Claim

The trained-rung slate is narrowed from the original two-backbones-by-three-conditions matrix (six trained rungs, DeBERTa-v3 plus ModernBERT each at frozen-probe, LoRA, full-FT) to a single-backbone-by-three-conditions slate (three trained rungs, ModernBERT-base at frozen-probe, LoRA, full-FT) because the per-backbone native context-window asymmetry from ADR-014 Q3 (DeBERTa-v3 capped at 512 tokens vs ModernBERT capped at 8192) produces an irreducible truncation confound on the indirect-injection zero-shot OOD slice — the cross-backbone delta-AUROC on BIPIA would be partly architecture and partly chunked-vs-native truncation aggregation, with no decomposition possible inside the project’s compute and calendar budget. Reference rungs from ADR-007 are preserved unchanged — OpenAI LLM-judge plus Anthropic LLM-judge plus Lakera Guard plus ProtectAI deberta-v3-base-prompt-injection — each called at its published native configuration including its native truncation policy. Multi-seed protocol at the ADR-006 floor of three seeds yields three trained rungs times three seeds times five LODO folds equals 45 training runs per evaluation. The fallback ladder is updated to 1x3 then 1x2 then 1x1; the original 2x3 then 2x2 step is no longer applicable. Backbone choice is hard-locked to ModernBERT-base; a silent fallback to DeBERTa-v3-base is explicitly prohibited — any catastrophic operational issue (checkpoint inaccessible, library compatibility break under uv pin resolution) requires a superseding ADR rather than a silent swap.

Acceptance criterion

ADR-007 carries status Superseded with superseded_by 015; SPEC_GREENFIELD ledger row 330 reads locked-to-ModernBERT-base (see ADR-015); SPEC_SHEET context paragraph and §4 model recipe reflect the 1x3 trained slate with three trained rungs plus four reference rungs; assumptions.md A-002 budget revised; tests/test_invariants.py has skip-marked stub test_trained_backbone_modernbert_only_invariant asserting the trained-rung slate contains exactly ModernBERT-base across three conditions; the Phase 2 training pipeline produces three trained rungs times three seeds times five LODO folds equals 45 per-row prediction parquet files at evals/predictions/.

CLAIM-016 - ADR-016: Data design bundle — source slate, splits, HF pinning, dedup, ordering, ceilings, ref-scorer audit

Status: Accepted

Source: SPEC_GREENFIELD.md §1 Data rows 323-329 + Phase 0-02 walk

Closing commit/ADR: 8da7018

Claim

Phase 0-02 locks seven §1 Data decisions as a single prototype-scoped bundle — (Q1 row 323) source slate Path α retains the ADR-008-vetted positives (deepset plus Lakera-gandalf plus Lakera-mosscap plus HackAPrompt), benigns LMSYS-Chat-1M plus UltraChat, and OOD slate NotInject plus XSTest plus JBB-Behaviors plus BIPIA plus InjecAgent (HarmBench plus Tensor Trust plus LLMail-Inject deferred to afterword as named next-iteration extensions); (Q2 row 327) splits structure is LODO k=4 over positive sources plus 3 seeds equals 12 observations per rung and 36 total trained runs across 3 ModernBERT-base conditions, with within-fold stratified k-fold deferred to afterword as the Fomin-2025-aligned variance-decomposition extension; (Q3 row 324) HF dataset revisions plus GitHub commit SHAs pinned at Phase 1 entry in unified data/source_manifest.yaml; manifest-documented bumps; ADR per bump only on schema change; (Q4 row 325) dedup encoder is all-MiniLM-L6-v2 cosine at threshold 0.80 with simplified Phase 1 calibration evidence on a 50-pair labeled holdout; MPNet-base-v2 plus full 4-gate selection rule deferred to afterword; (Q5 row 326) cross-source benign dedup order is within-source-first then cross-source with LMSYS-priority tiebreak applied before the LODO split per ADR-008 direction; (Q6 row 329) per-source ceilings are 3000 positives per source for mosscap and HackAPrompt and use-all for deepset plus Lakera-gandalf post-dedup, plus 10000 benigns per source for LMSYS plus UltraChat, with random subsample at seed equals 42; quality-filtered HackAPrompt plus attack-type-stratified plus length-stratified subsamples deferred to afterword; (Q7 row 328) reference-scorer audit pattern for partial disclosure is fold-pattern analysis plus stated-scope cross-check on both Lakera Guard and ProtectAI deberta-v3 reference rungs (option B), with cross-source same-style ablation (option C) deferred to afterword. All seven decisions cohere under the prototype framing — clear claim shape with explicit future-work axes named in WRITEUP/limitations-and-future-work.md.

Acceptance criterion

SPEC_GREENFIELD ledger rows 323/324/325/326/327/328/329 carry locked-to-X (see ADR-016) status; SPEC_SHEET §3.1 source-slate table is populated with all 11 sources and per-source roles plus licenses; SPEC_SHEET §3.2 splits section carries the LODO k=4 plus 3-seed methodology lock; SPEC_SHEET §3.4 OOD slate table is populated with the 5 OOD sources and their roles; assumptions.md carries A-005 new (Phase 1 audit revisit triggers — benign contamination above 2 percent or class-balance outside 1-to-3-to-1-to-10 or per-source labeling-quality systematic mislabeling or length-distribution divergence from dossier estimates triggers superseding ADR); tests/test_invariants.py contains skip-marked stubs test_source_manifest_schema_valid and test_dedup_calibration_persisted and test_benign_contamination_scan_clean; data/source_manifest.yaml is the Phase 1 deliverable that pins all 11 source revisions and licenses plus per-source row counts; evals/dedup_calibration.json is the Phase 1 deliverable that persists per-source plus cross-source cosine distribution histograms plus FPR plus FNR at threshold 0.80 against a 50-pair labeled holdout; WRITEUP/limitations-and-future-work.md spoke contains a prioritized Next Iteration Priorities subsection enumerating the seven afterword extensions.

CLAIM-017 - ADR-017: Trained-rung-slate expansion — TF-IDF plus LR classical floor and frozen-probe dual role

Status: Accepted

Source: SPEC_GREENFIELD.md §2 Model ledger rows 331 + 332 + Phase 0-03 walk Q1 + Q1b

Closing commit/ADR: cfa7559

Claim

Phase 0-03 expands the trained-rung slate from the three-condition ModernBERT-base lineup locked by ADR-015 to a four-rung architecture by prepending a classical-NLP floor rung — TF-IDF plus Logistic Regression — that restores the SPEC_GREENFIELD §2 documented common-pattern default (linear floor then frozen-features probe then off-the-shelf classifier then adapter-fine-tuned transformer) that ADR-007 originally deviated from. The classical floor rung uses sklearn TfidfVectorizer with combined word 1-2-grams capped at 15000 features and char 3-5-grams capped at 15000 features (FeatureUnion stacked sparse matrix), feeding sklearn LogisticRegression with solver=liblinear plus C=1.0 plus class_weight=balanced plus max_iter=1000 — fit-to-convergence with sklearn default tolerance, no epoch concept. The frozen-probe role (ledger row 332) is locked as both candidate detector (appears in headline table alongside LoRA and full-FT with same metrics and same statistical machinery) and diagnostic anchor (the methodology spoke uses the lift-delta chain TF-IDF+LR then frozen-probe then LoRA then full-FT as a three-step capability decomposition narrative — pretrained transformer features beat classical features, adapter tuning adds something, full backbone fine-tuning adds something beyond adapters). The expansion is methodologically additive — it does not supersede ADR-015 because ADR-015 locked the ModernBERT-base single-backbone architecture for the transformer slate (still valid) — the classical-NLP rung occupies a separate spec axis (different feature space, different library stack — sklearn versus HF). Training-time scope (ledger row 331) is formally locked at frozen-probe plus LoRA plus full-FT (the three conditions enumerated by ADR-015 plus the classical-floor addition) — uniform across all four trained rungs for clean cross-rung lift attribution.

Acceptance criterion

SPEC_GREENFIELD ledger row 331 carries locked-to-frozen-probe-plus-LoRA-plus-full-FT (see ADR-017 complements ADR-015) status; ledger row 332 carries locked-to-both-candidate-detector-plus-diagnostic-anchor (see ADR-017) status; SPEC_SHEET §4 model recipe gains new §4.1 (Classical floor rung — TF-IDF plus LR specification) before the existing transformer-rung sections; SPEC_SHEET §4 enumerates the trained-rung slate as four rungs (TF-IDF+LR plus ModernBERT-base across frozen-probe plus LoRA plus full-FT); assumptions.md A-001 updated to reflect run count of 48 (four trained rungs times three seeds times four LODO folds) instead of 36 with the fallback ladder still applicable to the transformer rungs; tests/test_invariants.py contains skip-marked stub test_classical_floor_rung_present asserting the TF-IDF+LR rung is in the trained-rung config enumeration with sklearn TfidfVectorizer plus LogisticRegression(class_weight=balanced) plus combined word-1-2-grams plus char-3-5-grams plus per-vectorizer max_features 15000; tests/test_invariants.py existing stub test_trained_backbone_modernbert_only_invariant is reframed to assert trained transformer rungs (not all trained rungs) contain exactly ModernBERT-base across three conditions (carve out the classical floor rung); WRITEUP methodology spoke contains a dedicated lift-delta-chain subsection narrating the three-step capability decomposition (TF-IDF+LR then frozen-probe quantifies pretrained-transformer-feature contribution, frozen-probe then LoRA quantifies adapter-tuning contribution, LoRA then full-FT quantifies full-backbone-FT contribution).

CLAIM-018 - ADR-018: Reference scorer slate and contamination stratification — OpenAI plus Anthropic LLM-judges plus ProtectAI v1 and v2 plus per-axis matched-budget

Status: Accepted

Source: SPEC_GREENFIELD.md §2 Model ledger rows 333 + 334 + Phase 0-03 walk Q2 + Q3

Closing commit/ADR: cfa7559

Claim

Phase 0-03 locks the reference-rung slate at four rungs — one OpenAI LLM-judge (gpt-4o-2024-08-06 stable snapshot) plus one Anthropic LLM-judge (claude-sonnet-4-6 with date-suffixed snapshot ID pinned at Phase 1 entry per Anthropic API documentation snapshot suffix convention) plus two ProtectAI off-the-shelf classifiers (deberta-v3-base-prompt-injection v1 plus deberta-v3-base-prompt-injection-v2) — and partially supersedes ADR-015 reference-slate enumeration by dropping Lakera Guard (ToS verification overhead plus partial-disclosure complexity unacceptable for prototype scope, named in afterword extension) and adding ProtectAI v1 alongside v2 (provides internal v1-to-v2 lift comparison parallel to the trained-rung-lift narrative — what off-the-shelf classifier updates buy you). LLM-judge calls use temperature equals zero (deterministic; multi-seed irrelevant per ADR-007 line 50 framework preserved); prompt template versioned in repo; one call per eval row. The four reference rungs are stratified along ADR-005 three-state contamination taxonomy — TF-IDF+LR is verified_disjoint (trained on our LODO splits by construction, no possibility of pretrain contamination); ModernBERT-base across the three transformer conditions is backbone-partial-disjoint (fine-tuning disjoint by LODO construction; backbone pretrain corpus may include eval sources); ProtectAI v1 and v2 are suspected_contamination (partial training-corpus disclosure may include eval positives); gpt-4o and claude-sonnet-4-6 are vendor_black_box (closed corpora may include eval sources via web-scale pretraining). The trained-rung-vs-reference comparison is reported with this stratification explicit in a dedicated WRITEUP methodology spoke section so reviewer interpretation aligns — any trained-rung lift over LLM-judges is despite the LLM-judge pretrain advantage, and the TF-IDF+LR rung provides the only fully-disjoint anchor. Matched-budget controls (ledger row 333) lock to per-axis — data and eval methodology are matched (same train and eval splits per ADR-016; same metrics and statistical machinery per ADR-006); training compute is not matched (each rung uses its natural recipe; training compute is reported alongside the metric so AUPRC versus compute can be plotted as a Pareto frontier — the rung-ladder IS the Pareto frontier). Per-axis matching is the only framing that coherently handles the heterogeneous cost classes (LLM-judge dollars-per-call versus trained rungs GPU-minutes versus ProtectAI inference-only). A new assumption A-006 (severity medium) registers the contamination caveat and gates the methodology spoke section.

Acceptance criterion

SPEC_GREENFIELD ledger row 333 carries locked-to-per-axis-matched-data-and-eval-not-compute (see ADR-018) status; ledger row 334 carries locked-to-gpt-4o-2024-08-06-plus-claude-sonnet-4-6-plus-protectai-v1-plus-protectai-v2-drop-lakera (see ADR-018 partially supersedes ADR-015) status; SPEC_SHEET §4 reference-rung enumeration is updated to four rungs (gpt-4o-2024-08-06 plus claude-sonnet-4-6 plus ProtectAI v1 plus ProtectAI v2); SPEC_SHEET §4 gains a contamination-state column annotating each rung with the ADR-005 three-state taxonomy label (verified_disjoint or backbone-partial-disjoint or suspected_contamination or vendor_black_box); assumptions.md gains A-006 new (severity medium — all reference rungs carry uncontrolled training-data leakage relative to our eval slate; ProtectAI v1 plus v2 partial-disclosure may include eval positive sources; gpt-4o plus claude-sonnet-4-6 closed corpora may include eval sources via web-scale pretraining; ModernBERT-base backbone for trained rungs 2-4 backbone pretrain corpus may include eval sources; reporting consequence — every reference-rung headline metric reported with explicit contamination-state tag; methodology spoke includes contamination-stratification subsection; Phase 1 mitigation — contamination scan via MiniLM cosine between eval set and known public mirrors of training data provides partial evidence of overlap for ProtectAI plus does not help for LLM judges closed corpora); data/source_manifest.yaml from ADR-016 is extended with a models section pinning HF revision SHAs for ProtectAI v1 plus v2 plus a judges section pinning LLM-judge snapshot IDs (gpt-4o-2024-08-06 plus claude-sonnet-4-6 with date-suffixed snapshot ID resolved at Phase 1 entry); EVIDENCE.md gains an entry per reference rung per the 3-state taxonomy with the contamination-state label and rationale; WRITEUP methodology spoke contains a dedicated Contamination stratification subsection narrating the four-tier disclosure gradient; WRITEUP methodology spoke contains a dedicated Matched-budget framing subsection narrating per-axis matching plus the Pareto frontier framing for compute reporting; LLM-judge prompt template is versioned in src/judges/prompt_template_v1.md plus calls use temperature equals zero per ADR-007 framework.

CLAIM-019 - ADR-019: LoRA and transformer-rung training recipe — hyperparameters, epochs, precision, class weighting

Status: Accepted

Source: SPEC_GREENFIELD.md §2 Model ledger rows 335 + 336 + 337 + Phase 0-03 walk Q4 + Q5 + Q6 + Q7

Closing commit/ADR: cfa7559

Claim

Phase 0-03 locks the complete training recipe for the three trained transformer rungs (ModernBERT-base across frozen-probe, LoRA, full-FT) as a single hyperparameter-immutability-compatible bundle. LoRA core hyperparameters use literature defaults from the LoRA paper and modern HF PEFT recipes — r equals 8 plus lora_alpha equals 16 (alpha equals 2r modern convention) plus lora_dropout equals 0.1 plus task_type SEQ_CLS plus bias none — with explicit deterministic target_modules enumeration listing four suffixes (Wqkv, attn.Wo, mlp.Wo, mlp.Wi) covering the fused QKV projection plus attention output projection plus both MLP projections per ModernBERT encoder layer (avoids dependency on PEFT all-linear auto-detection per SDD discipline). LoRA modules_to_save includes the classifier head (randomly-initialized; needs full fine-tuning alongside the LoRA adapters). Optimizer settings shared across the three transformer rungs — learning_rate equals 1e-4 plus warmup_ratio equals 0.10 plus lr_scheduler_type cosine plus per_device_train_batch_size equals 16 plus gradient_accumulation_steps equals 2 (effective batch 32 on H100 plus 80GB-VRAM equivalent classes; per ADR-020 the BATCH_TABLE scales per_device plus grad_accum on smaller GPU classes preserving effective batch 32) plus max_grad_norm equals 1.0 plus weight_decay equals 0.01 plus AdamW optimizer (HF Trainer default) plus DataCollatorWithPadding with max_length equals 8192 plus dynamic padding plus head-truncation when exceeded (per ADR-014 Q4 training-time policy). Epoch policy (ledger row 335) is locked at 2 epochs uniformly across all three transformer rungs (SPEC §2 default sweet spot; cross-rung uniformity eliminates the epoch-count confound when attributing lift between rungs) with per-epoch inference predictions persisted for every (rung, seed, fold) combination (epoch-2 is the headline number; epoch-1 is reported as a diagnostic ablation in the methodology spoke) — this converts SPEC §2 line 132 default suggestion of a 1-epoch-control on at-least-one-fold into a per-fold per-seed 1-epoch diagnostic at near-zero added cost. Intermediate full-FT weight checkpoints are not persisted to disk (approximately 150 megabytes per checkpoint times 12 training runs equals approximately 1.8 gigabytes of throwaway weights) — per-row inference predictions for epoch-1 are saved without persisting the underlying full-FT weights since predictions are the audit-relevant artifact for downstream analyses. LoRA plus frozen-probe intermediate adapter and head checkpoints are persisted (tiny; under 10 megabytes each). Precision policy (ledger row 336) is locked at bf16 for both training and inference with explicit fp32 cast before the final softmax or sigmoid (per SPEC §2 default; matches H100 native tensor-core throughput plus same dynamic range as fp32 avoiding gradient underflow concerns; ModernBERT was pretrained and tested with bf16 per Warner et al. 2024). Class-weight implementation (ledger row 337) is locked at sklearn-style class_weight balanced uniformly across all four trained rungs (TF-IDF+LR per ADR-017 Q1b lock plus the three transformer rungs via a WeightedTrainer subclass that overrides compute_loss with CrossEntropyLoss using a per-fold-recomputed compute_class_weight tensor) — uniform convention across the trained-rung slate enables clean cross-rung lift attribution and aligns with the sklearn convention already established by ADR-017 for the classical-floor rung. TrainingArguments configuration locks save_strategy equals epoch (saves per-epoch checkpoint for LoRA plus frozen-probe; full-FT intermediate checkpoints are deleted post-inference per the storage-management discipline above) plus eval_strategy equals no (no val-set-driven evaluation during training per SPEC §2 hyperparameter-immutability) plus seed slot iterated across the ADR-006 three-seed slate (42, 1337, 2025). Phase 1 verification task — after uv-resolving PEFT plus Transformers pinned versions, instantiate LoraConfig plus AutoModelForSequenceClassification.from_pretrained for answerdotai/ModernBERT-base wrapped with get_peft_model plus dump the resolved trainable parameter enumeration plus target module list to evals/lora_target_modules.json (asserts four LoRA layers per ModernBERT encoder block times 22 layers equals 88 LoRA adapter modules; classifier head is in modules_to_save; trainable parameter ratio approximately 0.5 to 1 percent of total).

Acceptance criterion

SPEC_GREENFIELD ledger row 335 carries locked-to-2-epochs-uniform-with-per-epoch-prediction-save (see ADR-019) status; ledger row 336 carries locked-to-bf16-training-plus-inference-with-fp32-softmax-cast (see ADR-019) status; ledger row 337 carries locked-to-sklearn-style-class-weight-balanced-uniform-across-trained-rungs (see ADR-019) status; SPEC_SHEET §4.2 through §4.4 transformer-rung subsections are populated with the locked recipe (LoRA core hyperparameters plus epoch policy plus precision plus class-weight); SPEC_SHEET §4 contains a Per-epoch prediction-save discipline subsection noting epoch-2 is headline plus epoch-1 is diagnostic; assumptions.md A-003 (pre-teardown persistence checklist per ADR-013) is extended in spirit to require per-epoch parquet predictions persisting before any pod teardown; tests/test_invariants.py contains skip-marked stub test_per_epoch_predictions_present asserting predictions exist for both epoch=1 and epoch=2 for every transformer (rung, seed, fold) combination per the 96-file enumeration; src/training/lora_config.py contains the locked LoraConfig instantiation matching the recipe; src/training/weighted_trainer.py contains the WeightedTrainer subclass with per-fold compute_class_weight; evals/lora_target_modules.json is the Phase 1 deliverable that captures the actual resolved LoRA target module enumeration plus trainable parameter count from the pinned PEFT and Transformers versions; src/inference/softmax_cast.py contains the fp32-cast-before-softmax helper.

CLAIM-020 - ADR-020: Compute infrastructure and cost discipline — runpod-deploy 0.7.7 primitives, GPU failover, adaptive batch sizing, dual-layer cost tracking

Status: Accepted

Source: SPEC_GREENFIELD.md §2 Model ledger row 338 + SPEC_GREENFIELD §Tech-Stack + Phase 0-03 walk Q8

Closing commit/ADR: cfa7559

Claim

Phase 0-03 locks the compute infrastructure plus cost discipline by adopting runpod-deploy 0.7.7 primitives end-to-end rather than hand-rolling equivalents (per the library-first discipline from CLAUDE.md). GPU class lock (ledger row 338) is an eight-class failover ladder via pod.gpu_order in priority order — NVIDIA H100 80GB HBM3 plus NVIDIA H100 NVL plus NVIDIA H100 SXM plus NVIDIA H100 PCIe plus NVIDIA H200 plus NVIDIA H200 NVL (tier 1 — 80GB plus bf16 native plus flash-attention-2 native) then NVIDIA A100-SXM4-80GB plus NVIDIA A100 80GB PCIe (tier 2 — 80GB plus bf16 native plus flash-attention-2 native plus ~50 percent H100 throughput) then NVIDIA L40S (tier 3 — 48GB plus bf16 native plus flash-attention-2 with fallback) then NVIDIA A100-SXM4-40GB (emergency tier 4 — 40GB plus bf16 OK plus may need flash-attention fallback). Datacenter failover via pod.datacenters set to US-MD-1 plus EU-RO-1 (dual-DC capacity resilience). Adaptive batch sizing preserves the ADR-019 effective batch equals 32 invariant across all GPU classes by scaling per_device_train_batch_size plus gradient_accumulation_steps together through a pre-locked BATCH_TABLE lookup keyed on detected GPU class — H100 plus H200 plus A100-80G use per_device 16 with grad_accum 2 (matches ADR-019 H100 default); A100-40G plus L40S use per_device 8 with grad_accum 4; L40 uses per_device 4 with grad_accum 8. This is NOT hyperparameter tuning (the effective batch — the actual gradient-computation hyperparameter — is held constant at 32 across all GPU classes); per_device plus grad_accum are throughput knobs that do not change the gradient computation; preserves SPEC §2 hyperparameter-immutability invariant. flash_attention_2 fallback per the runpod-deploy flash-attention-fallback recipe — model load wraps AutoModelForSequenceClassification.from_pretrained in try/except (ValueError, ImportError) catching the unsupported case on smaller GPU classes and degrading to stock SDPA without failing; per-run manifest logs which attn_impl was used so the audit trail is preserved (events.emit_event flash_attn_fallback when degraded). Cost cap is dual-layer — runpod-deploy budget.cost_cap_usd equals 125.0 enforced per-job by the orchestrator (matches A-002 upper-bound soft cap; matches A-002 envelope; one bad pod cannot exceed this without orchestrator-level intervention) plus a project-wide hard cap of 200 dollars enforced by scripts/cost_rollup.py CI-gated check aggregating across all per-pod runpod_deploy_pull_manifest.json files plus API call logs (LLM-judge gpt-4o plus claude-sonnet-4-6 spend tracked separately since API costs bypass runpod-deploy). assumed_hourly_rate_usd equals 3.50 set as H100 spot midpoint estimate (per runpod-deploy cost-reconciliation recipe, reconciled after first run by comparing manifest gpu_price_per_hour_usd to assumed rate; if actual differs materially the rate is bumped in subsequent config or split per GPU class). Preflight discipline mandates runpod-deploy validate –all (config schema plus DC reachability plus GPU stock check) plus runpod-deploy run –dry-run (cost preview without provisioning) before any billed run. Cost tracking is dual-layer — per-pod automatic via runpod_deploy_pull_manifest.json (captures wall_time_sec plus gpu_id plus gpu_price_per_hour_usd plus gpu_price_source plus estimated_cost_usd plus pod_final_state automatically; no code required) plus per-Makefile-target rolled up via evals/cost_ledger.csv (scripts/cost_rollup.py walks artifacts/runpod/star/runpod_deploy_pull_manifest.json plus API call logs and emits a timestamped per-target row with est_cost_usd plus actual_cost_usd plus gpu_hours plus api_calls plus notes). Soft-cap trigger threshold equals 80 dollars cumulative spend flags for review before next major run; soft-cap breach equals 125 dollars cumulative requires escalation discussion documented in evals/cost_decisions.md before further spend; hard-cap breach equals 200 dollars cumulative requires a superseding ADR documenting extension rationale before any further GPU or API spend.

Acceptance criterion

SPEC_GREENFIELD ledger row 338 carries locked-to-runpod-deploy-0.7.7-gpu-failover-plus-adaptive-batch-plus-dual-layer-cost-tracking (see ADR-020) status; SPEC_SHEET §4 compute section is populated with the 8-class gpu_order plus dual-DC failover plus the BATCH_TABLE specification plus the flash-attn-2 fallback policy plus the dual-layer cost-cap policy; configs/runpod/headline.yaml is the Phase 1 deliverable carrying the locked pod.gpu_order plus pod.datacenters plus budget.cost_cap_usd plus budget.assumed_hourly_rate_usd plus budget.poll_interval_sec configuration; src/training/batch_table.py is the Phase 1 deliverable implementing the pre-locked BATCH_TABLE keyed on torch.cuda.get_device_name with explicit KeyError handling (fails loudly with add this GPU class to BATCH_TABLE message rather than silently defaulting); src/training/load_modernbert.py implements the flash-attn-2 fallback recipe via try/except (ValueError, ImportError); scripts/cost_rollup.py is the Phase 1 deliverable aggregating per-pod manifests plus API call logs into evals/cost_ledger.csv with CI hard-gate on cumulative spend above 200 dollars hard-cap; tests/test_invariants.py contains skip-marked stubs test_flash_attn_fallback_present and test_effective_batch_constant_across_gpu_classes; decisions/library_imports.md runpod-deploy section is populated with the eight primitives invoked (runpod-deploy validate –all plus runpod-deploy run –dry-run plus runpod-deploy run plus runpod-deploy logs plus runpod-deploy stop plus runpod-deploy manifest-summary plus the pod.gpu_order schema plus the budget.cost_cap_usd schema plus the preflight.check_gpu_availability internal primitive plus the flash-attn-fallback recipe pattern plus the cost-reconciliation recipe pattern); evals/cost_ledger.csv schema documented with the seven columns (timestamp plus target plus est_cost_usd plus actual_cost_usd plus gpu_hours plus api_calls plus notes).

CLAIM-021 - ADR-021: Eval slate aggregation — pooled headline plus per-slice spoke and recall@FPR pinpoint feasibility

Status: Accepted

Source: SPEC_GREENFIELD.md §3 Eval ledger rows 339 and 342 + Phase 0-04 walk Q1 + Q4

Closing commit/ADR: b750d1d

Claim

Phase 0-04 locks the per-rung reporting layout for the §3 Eval framework as follows. (1) OOD slate aggregation — the 5 OOD slices locked by ADR-016 (NotInject plus XSTest plus JBB-Behaviors plus BIPIA plus InjecAgent) are reported in two complementary aggregation views. The PDF executive headline table carries a single pooled-OOD column per rung (concatenated rows across the 5 slices, single AUPRC plus AUROC plus recall@FPR plus ECE plus Brier per rung). The methodology spoke at WRITEUP/ood-analysis.md (new) carries the 5-rung-by-slice grid with per-slice bootstrap CIs computed on the same persisted predictions — no extra compute. Pooled-and-per-slice reporting is the hub-and-spoke ADR-004 framing applied to OOD — pooled for A1 (hiring manager exec scan) plus per-slice for A2 (ML researcher generalization-question-by-question read). (2) Recall@FPR pinpoints — the {0.1 percent, 1 percent, 5 percent} triad pre-locked by ADR-006 is ratified at the ledger-row level with one operational refinement — the 0.1 percent pinpoint is computed and reported at the pooled aggregation level only (where benign sample size of approximately 16 thousand to 20 thousand pooled rows yields a meaningful bootstrap CI), not at per-slice or per-LODO-fold aggregation levels (where benign sample size of approximately 200 to 1054 rows reduces the operating-point threshold to 0 to 1 false-positive count and the recall value becomes undefined or ill-conditioned). (3) Bootstrap volatility surfacing — for the 0.1 percent pinpoint specifically, the reporting discipline pre-commits four surfaces — half-width column alongside point estimate; flag marker when half-width exceeds 0.5 times point estimate; resample-degeneracy audit (fraction of resamples that pinned at less than 1 false-positive count); per-resample threshold-drift dump to evals/audit/pinpoint_threshold_drift.json. The methodology spoke explains why the 0.1 percent pinpoint is reported with wider CIs and is not computable per-slice — references PromptShield 2024-2025 larger-scale precedent for context.

Acceptance criterion

SPEC_GREENFIELD ledger row 339 carries locked-to-pooled-headline-plus-per-slice-spoke status; ledger row 342 carries locked-to-0.1pct-pooled-only-plus-volatility-surfacing status; SPEC_SHEET §3.4 OOD slate adds an aggregation-layout subsection naming pooled-headline + per-slice-spoke; SPEC_SHEET §5.1 primary descriptive metrics replaces [OPEN] on recall@FPR pinpoints with [LOCKED — {0.1pct pooled-only, 1pct, 5pct} per ADR-021]; WRITEUP/ood-analysis.md spoke filename pre-committed; WRITEUP/methodology.md gains a “Volatility surfacing at low-FPR pinpoints” subsection; tests/test_invariants.py contains skip-marked stub test_ood_aggregation_layout asserting the 5-slice configuration plus pooled-concatenation pipeline plus per-slice bootstrap CI computation pattern; tests/test_invariants.py contains skip-marked stub test_recall_at_fpr_pinpoint_volatility asserting the bootstrap-volatility-flag pipeline emits half-width + degeneracy-fraction columns for the 0.1 percent pinpoint at pooled level.

CLAIM-022 - ADR-022: Statistical inference apparatus — bootstrap N + stability check, multi-comparison stance, multi-seed protocol, paired-test method

Status: Accepted

Source: SPEC_GREENFIELD.md §3 Eval ledger rows 340 + 341 + 344 + 345 + Phase 0-04 walk Q2 + Q3 + Q6 + Q7

Closing commit/ADR: b750d1d

Claim

Phase 0-04 formalizes the statistical inference apparatus at the ledger-row level by ratifying and extending ADR-006’s brief-level pre-locks across four §3 Eval rows. (1) Bootstrap N plus stability check (row 340) — 10K iterations at seed=1 (BCa for marginals via eval_toolkit.bootstrap_ci, percentile for paired-Delta-CIs via eval_toolkit.paired_bootstrap_diff) as headline; 10K iterations at seed=2 as stability check; flag in audit JSON when the stability-check CI half-width differs from the headline CI half-width by more than 5 percent (signals resampling instability requiring escalation to 100K iterations or honest reporting of wider CI); parallelize across independent CI computations via joblib.Parallel(n_jobs=-1) at the orchestrator layer (library-first discipline preserved — primitive itself stays as eval-toolkit shipped) on the 64-core Threadripper CPU (approximately 10000 independent CIs across 84 trained plus 16 reference prediction parquets times 4 headline metrics times 3 recall@FPR pinpoints times 6 slice aggregations times approximately 28 rung-vs-rung pairs); upstream issue filed against eval-toolkit proposing optional n_jobs parameter on paired_bootstrap_diff for internal resample-loop parallelization. (2) Multi-comparison correction (row 341) — no formal correction applied per ADR-006 estimation-over-testing stance ratified; methodology spoke at WRITEUP/methodology.md gains an explicit “family of comparisons” acknowledgment paragraph citing Gelman and Loken 2014 garden-of-forking-paths plus ASA 2016 statement on p-values; pre-empts A2 reviewer concern about ~28 paired comparisons without re-importing significance-testing apparatus. (3) Multi-seed protocol details (row 344) — 3 seeds {42, 1337, 2025} per ADR-006 ratified; trained rungs have 12 (fold, seed) observations per rung (4 LODO folds times 3 seeds) per ADR-016; reference rungs have 4 (fold) observations per rung (no seed dimension, inference-only); paired-across-rungs implementation uses (a) row-level pairing for trained-vs-trained comparisons via eval-toolkit paired_bootstrap_diff, (b) per-row replication of reference scores across the 12 trained seeds for trained-vs-reference comparisons (reference-side variance correctly fold-only); rank-based metrics aggregate per-(fold, seed)-then-mean (12 values per rung yielding cross-fold CI per ADR-024); per-row metrics (ECE plus Brier) pool rows across (fold, seed) and compute once per rung; recall@FPR thresholds computed per-(seed) from val per-(rung, fold) and applied to test (12 thresholds per rung yielding 12 recall values per rung); calibration interventions fit per-(rung, fold, seed) yielding 12 calibrators per rung times 2 interventions per ADR-023; per-(rung, fold, seed) observations dumped to evals/audit/per_seed_observations.parquet per ADR-011 Guarantee 5; methodology spoke gains variance-attribution subsection decomposing per-rung variance into fold-to-fold plus seed-to-seed plus within-(fold, seed)-bootstrap noise. (4) Paired-test method (row 345) — eval-toolkit paired_bootstrap_diff (Efron-Tibshirani 1993 §10.3 row-level pairing) ratified; DeLong 1988 plus McNemar plus Cochran-Q rejected at the row level with multi-source-LODO-specific rationale (DeLong’s asymptotic Gaussian assumption breaks at our per-fold scale of approximately 4000 to 5000 benigns; designed for AUROC only not AUPRC or ECE; produces p-value contradicting estimation-over-testing; McNemar requires fixed-threshold commitment contradicting ADR-006 scenario-based threshold framing; Cochran-Q designed for fixed-classifier-vs-many-datasets — inverse of our setting). LLM-judge non-determinism at temperature=0 surfaced as new assumption A-007 (reference-rung scores cached at first call; re-run only on cache miss; inter-call variance not measured).

Acceptance criterion

SPEC_GREENFIELD ledger row 340 carries locked-to-10K-bootstrap-with-second-seed-stability-check status; ledger row 341 carries locked-to-no-formal-correction-with-Gelman-Loken-acknowledgment-paragraph status; ledger row 344 carries locked-to-3-seeds-paired-across-rungs-with-gap-honest-defaults status; ledger row 345 carries locked-to-paired-bootstrap-diff-with-multi-source-LODO-rejection-rationale status; SPEC_SHEET §5.2 statistical tests expanded with explicit listing of headline-bootstrap-iteration-counts plus stability-check-protocol plus multi-comparison-acknowledgment plus paired-across-rungs-implementation-details; assumptions.md gains A-007 documenting LLM-judge non-determinism reference treatment; decisions/library_imports.md eval-toolkit section populated with bootstrap_ci + paired_bootstrap_diff + paired_bootstrap_ece_diff + cross_validate_metric primitives; decisions/upstream_issues.md gains entry for paired_bootstrap_diff parallelization proposal; tests/test_invariants.py contains skip-marked stub test_bootstrap_n_and_stability_check asserting 10K-plus-10K-at-second-seed pipeline plus 5pct-half-width-flag emission; tests/test_invariants.py contains skip-marked stub test_paired_across_rungs_pairing asserting (a) trained-vs-trained row-level pairing plus (b) trained-vs-reference per-row replication pattern.

CLAIM-023 - ADR-023: Calibration battery composition — raw plus temperature plus isotonic interventions

Status: Accepted

Source: SPEC_GREENFIELD.md §3 Eval ledger row 343 + Phase 0-04 walk Q5

Closing commit/ADR: b750d1d

Claim

Phase 0-04 locks the calibration battery composition at row 343 as Option C revised — raw scores reported plus two validation-fit calibration interventions (temperature scaling and isotonic regression). The PDF executive headline table carries two calibration columns per rung — ECE-equal-mass (n_bins=15, quantile binning) and Brier score — both computed on the raw model outputs (no intervention applied). The methodology spoke at WRITEUP/calibration.md (new) carries the full battery — all four ECE variants from eval-toolkit (L1 plug-in via expected_calibration_error, L1 debiased via expected_calibration_error_debiased, L2 plug-in via expected_calibration_error_l2, L2 debiased via expected_calibration_error_l2_debiased), Brier score plus Brier decomposition into refinement-reliability-uncertainty components, reliability diagrams (equal-mass quantile binning per eval-toolkit recommendation under class imbalance), and intervention-delta tables showing how much miscalibration is correctable by (a) Guo 2017 1-parameter temperature scaling and (b) sklearn IsotonicRegression-wrapped fit_isotonic_calibrator non-parametric monotonic remapping. Both calibrators fit on the validation split per-(rung, fold, seed) yielding 12 calibrators per trained rung times 2 interventions; applied to the test split; ECE plus Brier re-computed on the calibrated scores. Per ADR-011 Guarantee 6, calibration fitting occurs on validation only — no test-set leakage. Per ADR-022’s multi-seed protocol, calibration interventions on the 4 reference rungs (LLM judges and ProtectAI v1/v2) fit per-(rung, fold) yielding 4 calibrators per reference rung times 2 interventions (reference rungs have no seed dimension). Calibration interventions are explicitly methodology-axis not deployment-axis — the temperature-vs-isotonic gap is the methodology-informative quantity (small gap means simple scaling captures most miscalibration; large gap means miscalibration has non-temperature-monotone structure recoverable only by non-parametric remapping). Calibration interventions are monotonic by construction and therefore do NOT change rank-based headline metrics (PR-AUC, ROC-AUC, recall@FPR) — this methodology subtlety is noted in the spoke. Platt scaling plus beta calibration deferred to afterword; maximum-calibration-error (worst-bin) computed and dumped to audit JSON but not reported in headline or spoke; per-slice calibration after intervention deferred unless reviewer asks (per-slice n too small for stable temperature fits).

Acceptance criterion

SPEC_GREENFIELD ledger row 343 carries locked-to-raw-plus-temperature-plus-isotonic-with-ECE-equal-mass-and-Brier-headline-plus-full-spoke-battery status; SPEC_SHEET §5.1 primary descriptive metrics replaces [OPEN] on calibration battery with [LOCKED — ECE-equal-mass(n_bins=15, quantile) plus Brier headline; full 4-ECE plus Brier-decomp plus reliability diagrams plus temperature plus isotonic intervention deltas in WRITEUP/calibration.md spoke per ADR-023]; WRITEUP/calibration.md spoke filename pre-committed; decisions/library_imports.md eval-toolkit section populated with all 4 ECE variants plus brier_score plus brier_decomposition plus reliability_curve plus fit_temperature plus fit_isotonic_calibrator primitives; tests/test_invariants.py contains skip-marked stub test_calibration_battery_composition asserting (1) headline emit contains ECE-equal-mass plus Brier per rung on raw scores; (2) spoke artifact contains all 4 ECE variants plus temperature-applied plus isotonic-applied deltas; (3) calibrator fits use only validation data per ADR-011 Guarantee 6 (no test-set rows in calibrator-fit input); tests/test_invariants.py contains skip-marked stub test_monotonic_intervention_preserves_ranks asserting PR-AUC after temperature equals PR-AUC before temperature within numerical tolerance (sanity check that the intervention machinery is monotonic).

CLAIM-024 - ADR-024: Cross-fold CI methodology — cv_clt_ci (Bayle 2020) headline plus block-bootstrap-on-folds spoke

Status: Accepted

Source: SPEC_GREENFIELD.md §3 Eval ledger row 346 + Phase 0-04 walk Q8

Closing commit/ADR: b750d1d

Claim

Phase 0-04 locks the cross-fold confidence-interval methodology at row 346 as Option F — hybrid library-first headline plus methodology-honest spoke ablation plus a conditional escalation path. (1) Headline CI machinery — eval_toolkit.cv_clt_ci (Bayle, Bayle, Janson and Mackey 2020 Annals of Statistics Theorem 3.1 implementation in eval-toolkit at src/eval_toolkit/bootstrap.py:963) on the 12 (fold, seed) per-rung metric values from ADR-022’s compute-per-(fold, seed)-then-aggregate rule. The cv_clt_ci primitive computes normal-approximation CI with sample variance (K-1)-denominator estimator at K=12 entries (4 folds times 3 seeds); BCa-replaced-by-normal-approx is appropriate at this K. (2) Spoke ablation — block bootstrap on (fold) blocks — resample 4 folds with replacement (with seeds inside each block as units); per-resample compute mean-of-fold-metrics; 10K resamples; percentile CI on the resampled fold-mean. Computed via custom orchestrator on top of eval_toolkit.bootstrap_ci primitive (resample-indices construction is project-specific glue; primitive is library-shipped — library-first discipline preserved). The block-bootstrap CI is reported alongside the cv_clt_ci CI in WRITEUP/methodology.md spoke as a sensitivity check addressing the LODO non-exchangeability concern (LODO folds are explicitly not exchangeable per ADR-016 design — each fold holds out a different positive source with different size and attack-style character; cv_clt_ci was derived for exchangeable k-fold). (3) Sensitivity-check threshold — if block_bootstrap_CI_halfwidth / cv_clt_CI_halfwidth exceeds 1.5 for any headline rung, flag in the spoke as “LODO non-exchangeability dominates within-fold variance; headline CI may understate uncertainty” — this becomes a named methodology finding rather than a hidden numerical caveat. (4) Conditional escalation — stratified-k-fold-within-LODO (Fomin 2025 + Nadeau-Bengio 2003 variance decomposition; approximately 5x compute relative to plain LODO) is pre-committed as a conditional escalation IF the project’s cost ledger at Phase 4 entry shows cumulative spend well below the ADR-020 per-job soft cap of 125 USD (rough threshold — under 75 USD cumulative); ELSE deferred to afterword. (5) Bates 2024 JASA nested-CV and Nadeau-Bengio 2003 standalone correction factor explicitly deferred to afterword — both require custom implementations not in eval-toolkit; their theoretical advantages do not outweigh the library-first plus 2.5-day-timeline cost.

Acceptance criterion

SPEC_GREENFIELD ledger row 346 carries locked-to-cv_clt_ci-headline-plus-block-bootstrap-on-folds-spoke-plus-conditional-stratified-k-fold status; SPEC_SHEET §5.2 cross-fold CI methodology replaces [OPEN] with [LOCKED — cv_clt_ci (Bayle 2020) headline plus block-bootstrap-on-folds spoke ablation plus conditional stratified-k-fold-within-LODO escalation if Phase 4 compute budget permits per ADR-024]; decisions/library_imports.md eval-toolkit section gains cv_clt_ci primitive entry; WRITEUP/methodology.md gains a “Cross-fold CI methodology and LODO non-exchangeability” subsection containing both CI values per rung plus the sensitivity-check flag; tests/test_invariants.py contains skip-marked stub test_cross_fold_ci_methodology asserting (1) cv_clt_ci primitive invoked on 12 per-(fold, seed) values per rung; (2) block-bootstrap-on-folds orchestrator produces percentile CI on 10K resamples of (fold) blocks; (3) sensitivity-check flag emits when block_bootstrap_halfwidth / cv_clt_halfwidth exceeds 1.5; (4) conditional-stratified-k-fold-within-LODO escalation gated on cost-ledger evals/cost_ledger.csv state at Phase 4 entry; assumptions.md A-008 added (severity medium) documenting LODO-non-exchangeability concern as load-bearing on cross-fold CI validity.

CLAIM-025 - ADR-025: Dual-policy threshold characterization at symmetric 1% cost weights

Status: Accepted

Source: SPEC_GREENFIELD.md §4 Threshold ledger row 347 + Phase 0-05 walk Q1 + Q2 + Q3 + Q4

Closing commit/ADR: e335739

Claim

Phase 0-05 locks the §4 Threshold ledger row 347 (cost-weight targets) at the per-rung operational level as follows. (1) Numeric cost-weight targets — symmetric 1% on both policy budgets. Detection policy targets FPR ≤ 1% on validation via eval_toolkit.TargetFPRSelector(0.01); Verification policy targets FNR ≤ 1% (equivalently recall ≥ 99%) on validation via eval_toolkit.TargetRecallSelector(0.99). The detection-policy operating point numerically coincides with the recall@FPR=1% headline pinpoint already locked in ADR-021 — the dual-policy framing relabels the existing column with a footnote rather than introducing a new one. (2) Aggregation surface — per-(rung, fold, seed) fitting on the validation split. 4 LODO folds × 3 seeds × 2 policies = 24 thresholds per trained rung; across 4 trained rungs equals 96 threshold-pair instances. Selection variance is propagated via eval_toolkit.paired_bootstrap_op_point_diff (two-level bootstrap — refit threshold on each val resample, apply on test resample, compute paired diff) consistent with ADR-022’s per-(seed) threshold protocol. (3) Reporting layout — one new headline column per trained rung — “FPR @ recall ≥ 99%” — for the verification policy; the detection-policy framing is captured by a footnote on the existing recall@FPR=1% column rather than by column duplication; the full dual-policy operating-point grid (4 trained rungs × 2 policies × {pooled-IID + pooled-OOD + 4 per-LODO-fold + 5 per-OOD-slice} aggregation levels — 80 cells per policy) lives in WRITEUP/threshold-policy.md spoke alongside the ≥3 deployment scenarios already mandated by ADR-006. (4) Infeasibility handling — honest reporting plus asterisk flag plus audit JSON. When TargetRecallSelector(0.99) cannot satisfy the recall ≥ 99% constraint on a (rung, fold, seed) val slice, the cell rendering carries an asterisk and the audit emits per-(rung, fold, seed) reachability evidence to evals/audit/verification_reachability.json with target_reachable plus achieved_val_recall plus fallback_threshold plus fallback_test_fpr fields. The methodology spoke gains a “Verification-target reachability across trained rungs” subsection. (5) Persistence pre-commit for post-hoc recall-floor sweeps — ADR-013 per-row val plus test prediction persistence is sufficient for re-fitting at alternative recall floors {95 percent, 99 percent, 99.9 percent} without retraining; pre-bootstrap CIs at the alternative floor regenerate via the existing paired_bootstrap_op_point_diff orchestrator-layer joblib pipeline per ADR-022. The “Recall-floor sensitivity sweep” is a one-commit afterword in WRITEUP/threshold-policy.md if a Phase 5 reviewer requests it. (6) Reference scorers excluded — dual-policy fitting applies only to the 4 trained rungs per SPEC §4 dual-policy applicability lock; the 4 reference rungs report recall@FPR pinpoints only with explicit contamination caveats per ADR-018 plus ADR-006. Cost-weighted thresholding remains rejected per ADR-006 — no CostSensitiveSelector use; the dual-policy framing is two anchor budgets along the ROC curve, not a Bayes-optimal cost-derivation.

Acceptance criterion

SPEC_GREENFIELD ledger row 347 carries locked-to-symmetric-1pct-with-honest-feasibility-reporting status (see ADR-025); SPEC_SHEET §5.3 Operating points subsection replaces the locked-pending-numeric-targets phrasing with explicit detection-FPR ≤ 1% plus verification-FNR ≤ 1% (equivalently recall ≥ 99 percent) plus selector-primitive-name plus aggregation-surface-name; SPEC_SHEET §5.1 primary descriptive metrics adds “FPR @ recall ≥ 99 percent” as the verification-policy column on trained rungs only with footnote on the existing recall@FPR=1% column tagging it as the detection-policy operating point; WRITEUP/threshold-policy.md spoke gains “Dual-policy operating-point grid” subsection (per-(rung, fold, seed) cells with paired_bootstrap_op_point_diff CIs at all aggregation levels) plus “Verification-target reachability across trained rungs” subsection plus optional “Recall-floor sensitivity sweep” afterword; decisions/library_imports.md eval-toolkit section gains TargetFPRSelector plus TargetRecallSelector plus paired_bootstrap_op_point_diff plus metrics_at_threshold entries; assumptions.md A-009 added (severity medium) documenting the verification-target reachability assumption with reachability JSON as load-bearing audit surface; tests/test_invariants.py contains skip-marked stub test_dual_policy_threshold_pairing asserting (a) per-(rung, fold, seed) selector fitting on val + test-application + paired_bootstrap_op_point_diff CI propagation; (b) only trained rungs participate (reference rungs excluded); (c) detection threshold equals TargetFPRSelector(0.01) and verification threshold equals TargetRecallSelector(0.99); tests/test_invariants.py contains skip-marked stub test_verification_reachability_audit asserting evals/audit/verification_reachability.json schema (per-(rung, fold, seed) entries with target_reachable plus achieved_val_recall plus fallback_threshold plus fallback_test_fpr fields) and that unreachable cells carry the asterisk flag in headline emit.

CLAIM-026 - ADR-026: Module layout — concern-grouped sub-packages under src/

Status: Accepted

Source: SPEC_GREENFIELD.md §5 Code architecture ledger row 348 + Phase 0-06 walk Q1

Closing commit/ADR: fa1ad33

Claim

Phase 0-06 locks the §5 Code architecture ledger row 348 (Module layout) at concern-grouped sub-packages under src/. The repo’s modelling code is organized as src/{data, training, scoring, eval, utils}/ with sub-package contents per concern — src/data/ holds source loaders + dedup + LODO splits + manifest validation; src/training/ holds ModernBERT loader + LoRA configurator + trainer; src/scoring/ holds reference-scorer adapters (ProtectAI v1/v2, Lakera-API, LLM-judges); src/eval/ holds calibration_battery + operating_points + slice_analysis; src/utils/ holds config_hash + paths + logging glue. CLI entrypoints live in scripts/ (orchestrate the above; one entrypoint per top-level operation — fit_dual_policy_thresholds.py, run_metrics_battery.py, run_bootstrap_battery.py, cost_rollup.py, regenerate_audit.py, check_no_emoji.py). Configs live in configs/{runpod, rungs, data}/ as versioned YAML. Tests live in tests/{conftest.py, test_invariants.py, unit/, smoke/, integration/} with marker-based slicing (per ADR-029). This layout was already implied by file paths cited across prior ADRs (library_imports.md references src/training/load_modernbert.py + src/eval/calibration_battery.py + src/eval/operating_points.py + scripts/fit_dual_policy_thresholds.py + scripts/run_bootstrap_battery.py + configs/runpod/headline.yaml); ADR-026 ratifies as the contract that Phase 1+ implementation must follow. Adding or moving a top-level src/ sub-package post-lock requires a superseding ADR.

Acceptance criterion

SPEC_GREENFIELD ledger row 348 carries locked-to-concern-grouped-subpackages-under-src status (see ADR-026); SPEC_SHEET §6 Code architecture gains a “Module layout” subsection with the 5-sub-package taxonomy + scripts/ + configs/ + tests/ surface enumerated; tests/test_invariants.py contains skip-marked stub test_module_layout_taxonomy asserting (1) src/{data, training, scoring, eval, utils} directories exist as Python packages (each contains init.py); (2) scripts/ contains only entrypoint files (no library code); (3) configs/{runpod, rungs, data} directories exist with at least one YAML file each at Phase 1 entry; the no-emoji invariant scan globs already operate over src/ scripts/ configs/ tests/ docs/ so the layout lock does not change scan-target enumeration.

CLAIM-027 - ADR-027: Smoke vs canonical separation — three Makefile targets stratified by execution context

Status: Accepted

Source: SPEC_GREENFIELD.md §5 Code architecture ledger row 349 + Phase 0-06 walk Q2

Closing commit/ADR: fa1ad33

Claim

Phase 0-06 locks the §5 Code architecture ledger row 349 (Smoke vs canonical separation) at three Makefile targets stratified by execution context. (1) make smoke runs pytest -m smoke + a fixture-data end-to-end pass through scripts/run_metrics_battery.py with configs/profiles/fixtures.yaml — laptop only, no GPU, no network, less than 10 minutes total. (2) make test-integration runs pytest -m integration — GPU-aware (uses CUDA via torch.cuda.is_available() if present, skips gracefully via pytest.importorskip and pytest.mark.skipif if not) — same target serves two execution contexts — local-GPU developer-workstation debugging AND cloud-pod pre-flight smoke check before paying for the canonical run. (3) make headline-cloud wraps runpod-deploy validate –all then runpod-deploy run –dry-run then runpod-deploy run –config configs/runpod/headline.yaml — RunPod-billed canonical evaluation deliverable, cost-cap-gated per ADR-020. The first two are tests (verification of glue + orchestration); the third is the actual evaluation deliverable, NOT a test. Math-rigor production-grade testing (Hypothesis property tests, golden-output snapshots) lives upstream in eval-toolkit (where the math implementations live and where the foundational-library-rigor 90% coverage floor applies); this repo’s test layer is debugging-grade by design — sufficient to catch breakage before paying for cloud time, not sufficient to substitute for upstream library validation. This honest framing is documented in WRITEUP/methodology.md so reviewers do not interpret debugging-grade local tests as production-grade methodology validation. Smoke fixture data lives at tests/fixtures/ (not data/) to keep fixture-vs-real separation visible. A separate make headline-dry-run target exposes runpod-deploy run –dry-run standalone for cost preview without provisioning. Cost-cap discipline per A-002 (per-job soft cap $125; project-wide hard cap $200) gates the headline-cloud target via pre-flight validate-all check.

Acceptance criterion

SPEC_GREENFIELD ledger row 349 carries locked-to-three-makefile-targets-stratified-by-execution-context status (see ADR-027); SPEC_SHEET §6 Code architecture gains a “Smoke vs canonical separation” subsection enumerating the three targets with their execution-context bindings + the honest debugging-grade framing; SPEC_SHEET §6 acceptance criteria mention reproducibility via documented commands gets explicit pointer to make smoke (fast verification) + make headline-cloud (canonical reproduction); Makefile gains placeholder targets for headline-dry-run and headline-cloud (placeholder = echo + exit 0 until configs/runpod/headline.yaml lands at Phase 1; the existing test-smoke and test-integration targets are ratified as locked); STYLE.md gains a “Test rigor scope” subsection documenting the debugging-grade-here-rigorous-upstream split (per Q2 framing); WRITEUP/methodology.md (Phase 5 deliverable) is required to contain the same debugging-grade-vs-upstream framing paragraph so reviewers cannot interpret local test coverage as production-grade methodology validation; tests/test_invariants.py contains skip-marked stub test_smoke_target_completes_under_ten_minutes asserting that make smoke completes in under 10 minutes wall-clock on laptop without GPU + does not require network access (verified via subprocess timeout and network-disabled environment) and skip-marked stub test_integration_gpu_aware_skip asserting that pytest -m integration tests use pytest.importorskip(“torch”) and pytest.mark.skipif(not torch.cuda.is_available()) idiom rather than failing on no-GPU laptops.

CLAIM-028 - ADR-028: Test coverage floor — 70% flat with co-locked upstream-issue-filing discipline

Status: Accepted

Source: SPEC_GREENFIELD.md §STYLE ledger row 350 + STYLE.md “project deltas” + Phase 0-06 walk Q3

Closing commit/ADR: fa1ad33

Claim

Phase 0-06 locks the §STYLE ledger row 350 (Test coverage floor) at 70% flat across the repo with a co-locked upstream-issue-filing discipline. The CI command is uv run pytest –cov –cov-fail-under=70 –cov-report=term-missing — single threshold across the codebase (no src/-vs-scripts/ stratification). The Makefile coverage target is updated from the current ungated form (pytest –cov=. –cov-report=term-missing) to the gated form. The 70% threshold reflects the case-study composition layer’s prototype-grade framing (per ADR-027 debugging-grade-here-rigorous-upstream split) — high enough to catch a whole-module-untested regression, low enough to avoid forcing low-value tests against orchestration glue. Co-locked process commitment extending ADR-006 + decisions/upstream_issues.md from library-primitive-gaps to test-coverage-gaps — when a local coverage gap is identified that would be better addressed by an upstream library test (e.g., a test pattern that should live in eval-toolkit’s harness coverage, or a runpod-deploy preflight scenario), file an issue at the upstream repo with the proposed test pattern + rationale. Log the upstream issue in decisions/upstream_issues.md with a “test-coverage-gap” tag in the row. This prevents the 70% floor from forcing low-value local tests when the right home for the test logic is upstream. If a gap genuinely cannot be filed upstream (project-specific glue), the gap is either tested locally OR explicitly documented as deferred via comment + upstream_issues.md “not-applicable” entry pointing at the deferral rationale. STYLE.md project-deltas section is updated from the prior “[OPEN — coverage floor; resolved at Phase 0]” placeholder to the locked 70%-flat-with-upstream-filing-discipline. Limitation — 70% is empirically chosen, not derived; if Phase 1 reveals chronic failure on legitimate orchestration glue with no viable upstream home, reopen via superseding ADR with the actual data. Extension condition — production-deployment scope extension lifts floor to 85% with src/eval at 90%; currently out-of-scope.

Acceptance criterion

SPEC_GREENFIELD ledger row 350 carries locked-to-70pct-flat-with-upstream-issue-filing-discipline status (see ADR-028); STYLE.md project-deltas section first bullet is rewritten from the prior “[OPEN — coverage floor; resolved at Phase 0]” placeholder to the locked 70% with the upstream-filing-discipline pointer; Makefile coverage target is updated from the current ungated form to “uv run pytest –cov –cov-fail-under=70 –cov-report=term-missing”; decisions/upstream_issues.md ledger gains a “test-coverage-gap” tag convention documented in the “How to use this ledger” section + a worked example row in the table (or a placeholder row marked TBD-at-Phase-1-entry); tests/test_invariants.py contains skip-marked stub test_coverage_floor_enforced asserting that the Makefile coverage target invokes pytest with –cov-fail-under=70 (or the equivalent CI invocation does so) — verification is via subprocess + assert that exit code is non-zero when synthetic coverage drops below 70%; SUBMISSION_AUDIT.md regenerates from the new ADR.

CLAIM-029 - ADR-029: Test marker strategy — ratify 4-marker stratification (unit / smoke / integration / network)

Status: Accepted

Source: SPEC_GREENFIELD.md §STYLE ledger row 351 + STYLE.md “project deltas” + pyproject.toml [tool.pytest.ini_options] + tests/conftest.py + Phase 0-06 walk Q4

Closing commit/ADR: fa1ad33

Claim

Phase 0-06 locks the §STYLE ledger row 351 (Test marker strategy) by ratifying the existing 4-marker stratification — unit, smoke, integration, network — already declared in pyproject.toml [tool.pytest.ini_options] markers list, mirrored in tests/conftest.py via pytest_configure addinivalue_line calls, and documented in STYLE.md project-deltas section. –strict-markers is enabled in pyproject addopts so unknown markers fail loudly. Marker semantics — unit (fast, deterministic, no IO, no GPU, no network, less than 1 second per test typical); smoke (end-to-end fixture-data pass through real code paths, less than 10 minutes total, no GPU, no network); integration (exercises real external dependencies — GPU, HF Hub, RunPod — may skip via pytest.importorskip or pytest.mark.skipif on laptops without GPU per ADR-027 dual-execution-context pattern); network (strictly requires network access — HF Hub fetch, runpod-deploy GraphQL — skipped in offline CI). property and golden markers are explicitly NOT added — they belong upstream in eval-toolkit per ADR-027 debugging-grade-here-rigorous-upstream split. slow marker explicitly NOT added — smoke already plays this role for end-to-end tests; if a unit test ever creeps over 30 seconds, that is a code-smell the marker should not paper over. gpu sub-marker explicitly NOT added — pytest.importorskip(“torch”) plus pytest.mark.skipif(not torch.cuda.is_available()) is the standard idiom and handles the local-vs-cloud dual-execution case cleanly without taxonomy proliferation. Limitation — the 4-marker strata do not capture every cross-cutting concern; the discipline relies on conditional-skip idioms within a marker. Extension conditions — add gpu sub-marker if conditional skipif boilerplate exceeds approximately 5 tests; add slow if any unit test crosses 30 seconds; add property only if scope extends to writing project-specific math primitives (currently out-of-scope; math lives upstream). Adding or removing a marker post-lock requires a superseding ADR.

Acceptance criterion

SPEC_GREENFIELD ledger row 351 carries locked-to-four-marker-ratification status (see ADR-029); STYLE.md project-deltas section “Test markers” bullet is preserved as-locked (no edit needed — already correct); pyproject.toml [tool.pytest.ini_options] markers list and tests/conftest.py pytest_configure addinivalue_line calls remain in sync (already in sync at ADR-029 entry); tests/test_invariants.py contains skip-marked stub test_pytest_markers_registered_and_in_sync asserting that (1) pyproject.toml [tool.pytest.ini_options] declares exactly the 4 markers (unit, smoke, integration, network) — set equality check; (2) tests/conftest.py pytest_configure registers exactly the same 4 markers via addinivalue_line; (3) –strict-markers is enabled in addopts; (4) no test file uses an unregistered marker (verified via grep of @pytest.mark.* in tests/ + comparison against the registered set); test_marker_semantics_unit_no_io stub asserting that pytest -m unit tests do not perform network requests or GPU calls — verified via socket-blocking + cuda.is_available-mocked pytest run; SUBMISSION_AUDIT.md regenerates from the new ADR.

CLAIM-030 - ADR-030: Deliverable format — repo-only with Quarto-rendered HTML site via GitHub Actions (supersedes ADR-002)

Status: Accepted

Source: SPEC_GREENFIELD.md §Submission ledger row 347 + ADR-002 supersession trigger + Phase 0-07 walk Q1

Closing commit/ADR: 7979dc9

Claim

Phase 0-07 supersedes ADR-002 (PDF + repo dual-artifact). The deliverable is a public GitHub repository only — no PDF. The repo’s writeup surface is rendered as a static HTML site by Quarto from the existing .md files (WRITEUP.md hub plus WRITEUP/ spokes plus decisions/ ADRs plus a new index.qmd entry-point). The site auto-publishes to GitHub Pages on push to main and on tag push via a .github/workflows/publish.yml workflow using quarto-actions/setup@v2 plus quarto-actions/publish@v2. Output target is HTML only — no PDF auxiliary — to keep the build surface minimal and honor the user’s pivot away from PDF. Quarto rendering preserves the hub-and-spoke structure locked by ADR-004 (now superseded by ADR-031) while replacing the PDF hub artefact with the Quarto site’s index.qmd plus sidebar navigation. Source format stays .md — Quarto natively renders .md files — so no migration to .qmd is required. Reviewer email at submission carries three URLs — repo tree at submission tag for source pin; live GH Pages site for rendered reading; GH release page for CHANGELOG and asset bundle — plus transcripts as private email attachment per existing convention. The PDF-build pipeline scoped for Phase 5 in ADR-002 is removed; pandoc plus LaTeX dependencies drop out of pyproject.toml plus Makefile. Limitation — Quarto binary dependency adds a build-tool install to the contributor surface; first-time GitHub Pages plus GH Actions setup carries deadline risk that the v0.9.0-rc1 rehearsal tag from ADR-033 explicitly mitigates. Extension condition — if a reviewer requests a PDF post-submission, Quarto can produce one from the same source via format augment in frontmatter; addition requires superseding ADR per the SDD discipline.

Acceptance criterion

ADR-002 frontmatter status changes from Accepted to Superseded with superseded_by 030; SPEC_GREENFIELD ledger row 300 (Deliverable format) carries a supersession note pointing at ADR-030; SPEC_GREENFIELD ledger row 347 (PDF bundle composition) status changes from open to locked-to-quarto-html-via-gh-actions (see ADR-030, ADR-031); _quarto.yml exists at repo root declaring project type website with output-dir _site and a navbar plus sidebar referencing WRITEUP.md plus the spoke list plus decisions/; .github/workflows/publish.yml exists declaring on push branches main plus tags v* triggers with quarto-actions/setup@v2 plus quarto-actions/publish@v2 target gh-pages and permissions block (contents write plus pages write plus id-token write); GH Pages enablement is configured (Settings then Pages then Source then gh-pages branch); tests/test_invariants.py contains skip-marked stub test_quarto_site_config_present asserting _quarto.yml exists and parses as valid YAML and declares project type website and that .github/workflows/publish.yml exists and references the two quarto-actions steps; SUBMISSION_AUDIT.md regenerates from the new ADR.

CLAIM-031 - ADR-031: Reviewer reading paths — index.qmd entry-point with Quarto sidebar nav (supersedes ADR-004)

Status: Accepted

Source: SPEC_GREENFIELD.md §Brief ledger row 302 (reviewer profile) supersession trigger + ADR-004 supersession trigger + Phase 0-07 walk Q1

Closing commit/ADR: 7979dc9

Claim

Phase 0-07 supersedes ADR-004 (PDF-as-hub framing) while preserving its hub-and-spoke structure plus A1+A2 dual-audience reading paths plus B4 open-ended-layered reading-time stance. The hub artefact shifts from PDF to the Quarto-rendered HTML site introduced by ADR-030 — specifically the new index.qmd entry-point file at repo root plus the sidebar nav declared in _quarto.yml. index.qmd plays the role of the PDF cover sheet — it carries reviewer reading-path guidance (A1 skim ~15 minutes; A2 audit ~60 minutes; deep-dive reproduce-numbers path) plus a TOC link list into WRITEUP.md plus WRITEUP/ spokes plus decisions/ ADRs plus configs plus results. Sidebar nav in _quarto.yml provides global navigation across every page so a reviewer never needs to backtrack to index.qmd to jump topics. The spoke list finalized at this lock — eight spokes — WRITEUP/eval-design.md plus WRITEUP/methodology-guarantees.md plus WRITEUP/limitations-and-future-work.md plus WRITEUP/data-decisions.md plus WRITEUP/model-rungs.md plus WRITEUP/threshold-policy.md plus WRITEUP/reference-scorer-audit.md plus WRITEUP/reproducibility.md (new — required by ADR-034 to document the T0+T1+T3 tier ladder). Every spoke is standalone-readable plus closes with a limitations-plus-when-to-extend subsection per ADR-005. Cross-link discipline — relative markdown paths between spokes resolve under Quarto rendering; permalinks from outside the site use stable submission tag per ADR-033. The ADR-004 PDF cover-to-cover constraint is replaced by a single-screen-fit constraint on index.qmd plus the Quarto sidebar must surface all spokes (no link-only depth files hidden from the sidebar). Limitation — Quarto sidebar nav requires reviewers to be in-browser plus comfortable with sidebar UI; the offline-reading degradation noted in ADR-030 applies here too. Extension condition — if a new methodology spoke surfaces during Phase 1+ (e.g., a contamination-scan deep dive that is too long for limitations-and-future-work.md), add the spoke to WRITEUP/ plus update _quarto.yml sidebar plus add an index.qmd link entry; no superseding ADR needed for spoke addition (the spoke list is provisional); spoke removal does require a superseding ADR (audit-trail discipline).

Acceptance criterion

ADR-004 frontmatter status changes from Accepted to Superseded with superseded_by 031; SPEC_GREENFIELD ledger row 302 (Reviewer profile) carries a supersession note pointing at ADR-031; index.qmd exists at repo root with three reading-path sections (A1 quick-skim plus A2 audit plus deep-dive reproduce-numbers) plus a TOC link list into WRITEUP.md plus all 8 spokes plus decisions/ plus configs/ plus results/; _quarto.yml sidebar block surfaces every spoke under a Methodology section plus every ADR under a Decisions section (Quarto sidebar auto-include glob applied to the decisions/ADR-*.md pattern); WRITEUP/reproducibility.md exists as a placeholder spoke skeleton with at minimum a title plus a tier-ladder table stub (full content populated at Phase 5 per ADR-034); tests/test_invariants.py contains skip-marked stub test_index_qmd_reading_paths_present asserting index.qmd exists and contains the three reading-path section headers; SUBMISSION_AUDIT.md regenerates from the new ADR.

CLAIM-032 - ADR-032: HF Hub checkpoint publication — publish primary headline rungs only with model card discipline

Status: Accepted

Source: SPEC_GREENFIELD.md §Submission ledger row 348 + Phase 0-07 walk Q2

Closing commit/ADR: 7979dc9

Claim

Phase 0-07 locks SPEC_GREENFIELD ledger row 348 (HF Hub checkpoint publication) by publishing the primary headline rungs only (Option C of the Phase 0-07 Q2 walk) — typically 2-4 trained checkpoints corresponding to the rungs the writeup leads with, NOT every ablation rung. Reference scorers (per ADR-018 — protectai/deberta-v3-base-prompt-injection plus the rest of the reference slate) are NOT republished since they are already public artefacts authored by others. The proposed publication set at lock time — final composition revisitable at Phase 5 once the rung ladder settles — is BBehring/prompt-injection-modernbert-frozen-probe (frozen-probe baseline) plus BBehring/prompt-injection-modernbert-lora (LoRA-best rung per ADR-019) plus conditionally BBehring/prompt-injection-modernbert-fullft (full-FT-best rung if promoted to headline status per ADR-019 final composition) plus conditionally BBehring/prompt-injection-tfidf-lr-classical-floor (per ADR-017 classical floor rung if included in headline narrative). TF-IDF + LR rung publication is conditional because sklearn-pipeline serialization to HF Hub is less standardized than transformers checkpoints; Phase 5 work item — assess whether joblib pickle plus a model-card-only repo is sufficient or whether the rung stays unpublished. Naming convention — BBehring/prompt-injection- (lowercase, kebab-case after the prompt-injection prefix). Model cards conform to HF Hub model-card YAML frontmatter schema — license (apache-2.0 inherited from ModernBERT-base) plus tags (text-classification, prompt-injection, safety) plus datasets (HF dataset IDs at the pinned SHAs per ADR-016) plus model-index.results (per-rung headline metrics from results.json with the pooled-OOD column per ADR-021) plus intended use (research and methodology characterisation; NOT production deployment per ADR-005) plus limitations (link back to WRITEUP/limitations-and-future-work.md) plus citation (repo URL at submission tag plus author plus date). Each model card README is generated mechanically from the writeup spokes plus results.json at Phase 5 (not hand-written per checkpoint) via a scripts/generate_model_cards.py orchestrator that takes the published-rung list as input. HF Hub authentication for the publication push uses the standard huggingface_hub token discovery mechanism (env var HF_TOKEN plus ~/.cache/huggingface/token) — secrets management discipline deferred to Phase 0-08. The publication step runs once per rung at Phase 5 close (before the v1.0.0 submission tag per ADR-033) and is gated by the v0.9.0-rc1 rehearsal tag — at least one rung must publish successfully to HF Hub before the rehearsal tag is considered passed. Limitation — published checkpoints can be probed offline for adversarial blind-spot discovery (acceptable for a methodology submission scope per ADR-005; would be unacceptable for a deployed defensive classifier). Extension condition — if the Phase 3 rung ladder produces additional ablation rungs that materially shape the writeup narrative (e.g., a specific rung that demonstrates a critical failure mode), promote that rung into the publication set via Phase 5 ADR amendment without superseding this ADR (the discipline is locked; the exact list is provisional).

Acceptance criterion

SPEC_GREENFIELD ledger row 348 carries locked-to-publish-headline-rungs-only-with-model-card-discipline status (see ADR-032); decisions/library_imports.md notes huggingface_hub as a runtime dependency for the publication step (already noted upstream from ADR-013 and ADR-016 but the publication-side use is new); a placeholder Phase 5 work item is captured in assumptions.md or a Phase 5 checklist file noting that scripts/generate_model_cards.py and the publication push are required before the v1.0.0 submission tag; tests/test_invariants.py contains skip-marked stub test_hf_hub_publication_naming_convention asserting that any published model repo follows the BBehring/prompt-injection- pattern (verified at Phase 5 close via huggingface_hub.HfApi().list_repos with author BBehring filter) plus skip-marked stub test_model_card_schema_complete asserting that each published rung’s model card README YAML frontmatter contains the required keys (license, tags, datasets, model-index, intended use, limitations, citation); SUBMISSION_AUDIT.md regenerates from the new ADR.

CLAIM-033 - ADR-033: GitHub release strategy — rehearsal tag plus SemVer submission tag plus post-submission patches

Status: Accepted

Source: SPEC_GREENFIELD.md §Submission ledger row 349 + Phase 0-07 walk Q3

Closing commit/ADR: 7979dc9

Claim

Phase 0-07 locks SPEC_GREENFIELD ledger row 349 (GitHub release strategy) at a two-tag-canonical plus post-submission-patches policy (Option D plus C-post-submission of the Phase 0-07 Q3 walk). Tag sequence — v0.9.0-rc1 at end of Phase 4 (release candidate; dress-rehearsal tag that fires the full publish pipeline — Quarto site build per ADR-030 plus GH Pages deploy plus HF Hub model card pushes per ADR-032 — so that any first-time-GH-Actions plus HF Hub auth plus model card schema issues surface 24-plus hours before submission day) plus v1.0.0 at submission day (canonical reviewer reference; CHANGELOG.md entry committed; GH release object created via gh release create v1.0.0 plus the title flag plus the notes flag pointing at CHANGELOG plus the generate-notes flag) plus optional v1.0.x SemVer patch tags post-submission for typo plus link plus reviewer-feedback fixes (reviewer URL stays pinned at v1.0.0; live Quarto site reflects latest patch per ADR-030 push trigger). No phase-boundary tags during Phases 1-3 — the existing ADR closing_commit SHA field per ADR already provides ADR-granular pinning at finer grain than phase tags; phase-boundary tagging would add discipline overhead under tight deadline without paying back. Tag version format — vMAJOR.MINOR.PATCH per SemVer 2.0.0 (vs annotated suffix vs calendar-versioning); chosen for gh release UI compatibility plus clean MAJOR-bump path if a post-submission methodology revision lands. GH release assets for v1.0.0 — CHANGELOG.md (default) plus _site.tar.gz (offline-readable rendered Quarto site for reviewers without internet; built from quarto render output dir plus tarred); per-row predictions parquet files NOT attached as release binaries (they stay in results/predictions/ within the repo at their natural location to avoid duplication). CHANGELOG.md committed at submission per Keep-a-Changelog 1.1.0 format — entries written in human language not git-shortlog dumps; one entry per tag going forward maintained via gh release create with the notes flag. Reviewer email at submission carries three URLs — source pin at github.com tree v1.0.0 (canonical anchor; never drifts) plus live rendered Quarto site at brandon-behring.github.io (reflects latest publish) plus GH release page at github.com releases tag v1.0.0 (CHANGELOG plus _site.tar.gz download) — plus transcripts as private email attachment per existing convention. Limitation — GH Pages serves a single live URL pinned to whatever the latest deploy was; there is no built-in this-URL-is-frozen-at-v1.0.0 affordance without snapshot tooling; mitigated by the canonical-source-pin URL pointing at tagged source plus reviewers can git checkout v1.0.0 plus quarto preview for a frozen-rendered view. Extension condition — post-submission methodology revision (not patch-grade typo fixes; actual content revisions) bumps to v2.0.0 via superseding ADR with rationale plus reviewer-notification step; patch tags v1.0.x are reserved for non-methodology fixes only.

Acceptance criterion

SPEC_GREENFIELD ledger row 349 carries locked-to-rehearsal-plus-submission-plus-patches status (see ADR-033); CHANGELOG.md exists at repo root in Keep-a-Changelog 1.1.0 format with an Unreleased section (populated continuously) plus a stub for v1.0.0 (populated at submission tag); the v0.9.0-rc1 tag and v1.0.0 tag are created at their respective Phase 4 close and submission day (Phase 5 work items captured in assumptions.md or Phase 5 checklist); the .github/workflows/publish.yml workflow per ADR-030 triggers on tag push v* so both rehearsal and submission tags fire the Quarto-publish plus HF Hub model card pipelines; tests/test_invariants.py contains skip-marked stub test_submission_tag_changelog_present asserting that CHANGELOG.md exists and parses (the keepachangelog Python package can validate) plus contains a v1.0.0 section at submission close plus follows Keep-a-Changelog 1.1.0 section structure (Added plus Changed plus Deprecated plus Removed plus Fixed plus Security); SUBMISSION_AUDIT.md regenerates from the new ADR.

CLAIM-034 - ADR-034: Reproducibility tier — full ladder T0 (eval-from-hub) plus T1 (smoke) plus T3 (headline-cloud)

Status: Accepted

Source: SPEC_GREENFIELD.md §Submission ledger row 350 + SPEC_GREENFIELD §6 line 249 (kit-level reproducibility claim) + Phase 0-07 walk Q4

Closing commit/ADR: 7979dc9

Claim

Phase 0-07 locks SPEC_GREENFIELD ledger row 350 (Reproducibility tier) at a layered three-tier reviewer-facing reproduction ladder (Option D of the Phase 0-07 Q4 walk) — T0 eval-from-hub (laptop; about 10-30 minutes; about zero dollars; downloads a published HF Hub checkpoint per ADR-032 and runs eval-only against fixture or held-out data; verifies headline scores reproduce without re-training) plus T1 smoke (laptop; about 10 minutes; about zero dollars; runs make smoke per ADR-027; verifies pipeline shape on fixture data; does NOT verify headline numbers — code health only) plus T3 headline-cloud (cloud-GPU; about hours; about 125 dollars per ADR-020; runs make headline-cloud; full re-training from scratch; verifies every step including training). T2 (make test-integration) stays a developer-tool tier — not promoted to reviewer-facing because it requires a local GPU the reviewer may not have and adds friction without enabling a new verification dimension (T0 covers eval; T3 covers full retraining; T2 only adds subset eval on local GPU which is strictly weaker than T3). The kit-level SPEC_GREENFIELD §6 line 249 lock — a stranger can clone install and reproduce headline numbers via documented commands — is operationally satisfied at T0 (cheapest highest-coverage for most reviewers) and stronger at T3 (full retraining). The smoke tier T1 is documented but framed as code-health-only not a math-correctness check. T0 is enabled by ADR-032 HF Hub publication; without published checkpoints T0 collapses and the reproducibility ladder reduces to T1 plus T3 only. T0 commands operationalize as make eval-from-hub (Phase 3 work item; about 30-50 lines of glue wrapping huggingface_hub.snapshot_download plus eval-toolkit scoring primitives plus per-row prediction emission) — accepts a rung name argument and downloads the corresponding BBehring/prompt-injection-modernbert- repo plus runs scoring against the eval data plus emits per-row predictions plus prints a score match table comparing against the committed results.json. The ladder maps approximately onto ACM artifact-review-and-badging conventions — T0 plus T1 supply Functional plus Reusable badge levels; T3 supplies the Reproducible badge level. WRITEUP/reproducibility.md spoke (slotted by ADR-031) documents the ladder with verbatim commands plus cost plus time plus what each tier verifies plus what each tier does NOT verify. Limitation — T0 reproduces headline scores only on the published rungs per ADR-032 Option C — ablation-rung reproduction requires T3; T0 verifies score-match against the published checkpoint not data-pipeline correctness — a reviewer who wants to verify the data preparation pipeline produces the same training tensors needs T3. Extension condition — if Phase 5 reveals T0 has correctness gaps (e.g., HF Hub checkpoint can drift due to model card edits invalidating cached SHAs), pin checkpoint SHAs in WRITEUP/reproducibility.md and make eval-from-hub via huggingface_hub.snapshot_download with the revision argument — the HF dataset SHA-pinning discipline from ADR-016 carries over to HF model SHA-pinning here; if a production-grade reproducibility scope extension lifts requirements (e.g., the writeup expands to include a deployment-grade claim), add T4 Docker-container-with-frozen-environment via superseding ADR (currently out-of-scope per ADR-005 plus ADR-027 prototype-grade framing).

Acceptance criterion

SPEC_GREENFIELD ledger row 350 carries locked-to-full-ladder-T0-T1-T3 status (see ADR-034); WRITEUP/reproducibility.md spoke exists with at minimum a title plus a tier-ladder table stub (full content populated at Phase 5 — verbatim commands plus cost plus time plus what-verifies plus what-does-not-verify); Makefile has a make eval-from-hub target placeholder (Phase 3 implementation; can be a stub echo at Phase 0-07 close); tests/test_invariants.py contains skip-marked stub test_reproducibility_tier_documented asserting (1) WRITEUP/reproducibility.md exists; (2) the spoke contains all three tier names (T0 plus T1 plus T3) in section headers; (3) each tier has a verbatim command (make smoke plus make eval-from-hub plus make headline-cloud) in a code block; (4) Makefile contains all three target names (smoke plus eval-from-hub plus headline-cloud) as rules; SUBMISSION_AUDIT.md regenerates from the new ADR.

CLAIM-035 - ADR-035: Secrets management — three-store split aligned with execution context (.env + RunPod pod-secrets + GH Actions repo Secrets)

Status: Accepted

Source: SPEC_GREENFIELD.md §Tech-Stack ledger row 305 + Phase 0-08 walk Q5

Closing commit/ADR: 5427b95

Claim

Phase 0-08 locks SPEC_GREENFIELD ledger row 305 (Secrets management) at a three-store split aligned with the three execution contexts that span the submission lifecycle — local laptop plus RunPod cloud pod plus GitHub Actions runner. Local laptop uses a gitignored .env file at repo root containing real tokens; consumer libraries (huggingface_hub plus openai plus anthropic plus runpod-deploy CLI) discover tokens via their default env-var discovery mechanism (canonical env var names HF_TOKEN plus RUNPOD_API_KEY plus OPENAI_API_KEY plus ANTHROPIC_API_KEY). RunPod cloud pod injects tokens as env vars on pod start via the runpod-deploy pod-secrets primitive declared in configs/runpod/headline.yaml per ADR-020. GitHub Actions runner accesses tokens via repo-level Settings then Secrets and variables then Actions (Web UI configuration); GITHUB_TOKEN is auto-injected by GH Actions runtime per ADR-030 publish workflow; HF_TOKEN added as a repo secret only if model card push runs in CI (per ADR-032 model card generation runs at Phase 5 close — may be local plus may be CI; secret added preemptively). A committed .env.example template at repo root (with placeholder values not real tokens) enumerates the four canonical env vars so reviewers can see the secret surface without running anything; .env.example serves the kit-level audit-friendly framing. The pre-commit gitleaks hook (already enabled and passing for all prior commits in this session) provides defense-in-depth against accidental .env commits; ADR-035 explicitly ratifies this gate as part of the secrets posture. Rotation protocol — token rotation requires updating all three stores in sequence (local .env then RunPod config then GH Actions Secrets); documented in CHANGELOG.md or a docs/secrets.md rotation runbook. Pre-flight verification — scripts/preflight_secrets.py (Phase 1 work item) asserts the four env vars are non-empty before any real-cost run (cloud dispatch plus LLM-judge API calls); fails loud per Python standards (ValueError with explicit message naming the missing token plus the consumer that would need it). Out-of-scope at this lock — cloud secret manager migration (Doppler plus Infisical plus 1Password plus AWS Secrets Manager plus GCP Secret Manager) is documented as a future-extension condition triggered by either (a) post-submission scope extension to production-grade deployment per ADR-005 plus ADR-027 framing or (b) Phase 1+ surfaces five-plus additional secrets that make the three-store rotation discipline costly. Encrypted-in-repo (git-crypt plus sops plus age) explicitly rejected — adds key management complexity plus contradicts library defaults (consumer libraries do not read encrypted blobs). Limitation — secrets in three stores means rotation discipline matters; mitigation is the preflight script. Extension condition — production-grade deployment scope extension triggers migration to a cloud secret manager via superseding ADR.

Acceptance criterion

SPEC_GREENFIELD ledger row 305 carries locked-to-three-store-split-aligned-with-execution-context status (see ADR-035); .env.example exists at repo root with placeholder values for the four canonical env vars (HF_TOKEN plus RUNPOD_API_KEY plus OPENAI_API_KEY plus ANTHROPIC_API_KEY); .gitignore covers .env plus .env.local plus .env.production patterns (verified via git check-ignore); pre-commit gitleaks hook is enabled in .pre-commit-config.yaml and is part of the standard pre-commit gate (verified by prior commits passing); scripts/preflight_secrets.py is captured as a Phase 1 work item in assumptions.md or a Phase 1 checklist file; tests/test_invariants.py contains skip-marked stub test_env_example_template_present asserting (1) .env.example exists at repo root; (2) the file enumerates all four canonical env vars HF_TOKEN plus RUNPOD_API_KEY plus OPENAI_API_KEY plus ANTHROPIC_API_KEY; (3) the values are placeholder-shaped not real tokens (regex check — no value matches the real-token signature for any of the four token classes); (4) .gitignore covers .env (verified via subprocess git check-ignore); SUBMISSION_AUDIT.md regenerates from the new ADR.

CLAIM-036 - ADR-036: Library version pins — tag pin to latest stable plus freeze for submission window (eval-toolkit v0.31.0 + runpod-deploy v0.7.7 + research_toolkit v1.9.1)

Status: Accepted

Source: SPEC_GREENFIELD.md §Tech-Stack ledger rows 307 + 308 + 309 + Phase 0-08 walk Q2

Closing commit/ADR: 5427b95

Claim

Phase 0-08 locks SPEC_GREENFIELD ledger rows 307 plus 308 plus 309 (the three library version pins) in a single bundled ADR at tag pin to latest stable plus freeze for submission window policy. Specific versions — eval-toolkit at v0.31.0 (latest as of 2026-05-16) plus runpod-deploy at v0.7.7 (pre-locked by ADR-020 plus matching latest) plus research_toolkit at v1.9.1 (latest as of 2026-05-16; same toolkit that produced the existing docs/research/ dossier). pyproject.toml dependencies stanza becomes the canonical pinning location with git+https URL plus tag specifier syntax — quote-eval-toolkit at git plus https github dot com slash brandon-behring slash eval-toolkit at v0.31.0 unquote plus matching specifiers for the other two libraries. uv.lock provides byte-level reproducibility on top of the tag pin (already committed per kit-level discipline). Pinning strategy — tag pin chosen over SHA pin (readability — pyproject answers what version is in scope at a glance) and over branch pin (auto-track contradicts methodology-submission audit-trail framing). Update policy — freeze for submission window (Phase 0-08 close until v1.0.0 submission tag per ADR-033); no routine bumps during Phase 1+ work; bump-triggers are exactly three — (1) blocking bug discovered upstream that breaks our use-pattern per decisions/library_imports.md, (2) critical security fix in the upstream (a CVE-grade event), or (3) reviewer-feedback-driven post-submission patch per ADR-033 v1.0.x discipline (post-submission only — submission tag freezes the pin). Routine the upstream has a new release is NOT a bump trigger. Each bump produces a new commit plus optional ADR cross-reference (the bump amends the version pin but does NOT supersede ADR-036 — the discipline is locked; the specific version moves). Freeze policy expires at v2.0.0 (per ADR-033 major-bump discipline) without superseding ADR — major-bump library pins get re-walked. Per-library requires-python compatibility — each upstream library has its own requires-python constraint; pinning above the strictest is fine; the three libraries are own-authored so Brandon controls each upstream constraint. Phase 0-08 close runs uv sync to verify all three install cleanly under the requires-python >=3.13 lock (per ADR-037); if any library blocks the >=3.13 floor, fix-forward is either (a) bump the upstream library requires-python to >=3.13 in a same-day patch, or (b) loosen ADR-037 pin to >=3.12 via superseding ADR. Limitation — tag pin trusts upstream tag immutability; force-pushed tags (which CLAUDE.md bans for own-authored repos) would invalidate the lock; defense-in-depth is uv.lock plus upstream commit discipline. Extension condition — post-submission iteration lasting longer than one patch cycle (v1.0.x) should re-evaluate the freeze policy; year-long freeze becomes a maintenance liability rather than a stability asset; freeze expires at v2.0.0.

Acceptance criterion

SPEC_GREENFIELD ledger rows 307 plus 308 plus 309 each carry locked-to-tag-pin-plus-freeze-with-specific-version status (see ADR-036); pyproject.toml lines 8-14 contain the three uncommented dependency specifiers in the form library-name at git plus https URL plus tag (eval-toolkit at v0.31.0; runpod-deploy at v0.7.7; research_toolkit at v1.9.1); uv sync runs cleanly under requires-python >=3.13 at Phase 0-08 close (verified manually at sub-session close); uv.lock is updated with the three library versions plus their transitive dependencies; decisions/library_imports.md gains a “Version pinning lock” subsection documenting the freeze policy plus the bump-trigger protocol; tests/test_invariants.py contains skip-marked stub test_pyproject_library_version_pins asserting (1) pyproject.toml dependencies stanza contains the three libraries at the locked versions (regex-grep for the specific tags); (2) uv.lock includes the three libraries (verified via uv.lock parse plus version field check); (3) no library is pinned to main branch or a branch other than a tagged version (regex check excluding the at-tag pattern); SUBMISSION_AUDIT.md regenerates from the new ADR.

CLAIM-037 - ADR-037: Python version pin — ratify requires-python >=3.13 (matches existing .python-version + bc8ce4e commit)

Status: Accepted

Source: SPEC_GREENFIELD.md §Tech-Stack ledger row 310 + Phase 0-08 walk Q1 + pre-existing commit bc8ce4e

Closing commit/ADR: 5427b95

Claim

Phase 0-08 locks SPEC_GREENFIELD ledger row 310 (Python version pin) at requires-python >=3.13 — ratifying the existing pre-Phase-0-08 state. The .python-version file at repo root reads 3.13 (committed at bc8ce4e). The pyproject.toml requires-python line reads >=3.13 (committed at bc8ce4e — chore plus pin Python to >=3.13 plus add .python-version per the prior session). Phase 0-08 Q1 walk surfaced four options — (A) ratify >=3.13; (B) tighten to ==3.13 dot star; (C) loosen to >=3.12; (D) loosen to >=3.11. Option A chosen — zero churn ratification with documented rationale. Rationale — Python 3.13 in active support through October 2029 per PEP 602 release schedule; uv handles installability transparently (uv sync auto-installs Python 3.13 from the pinned .python-version regardless of reviewer’s system Python); the three load-bearing libraries (eval-toolkit at v0.31.0 plus runpod-deploy at v0.7.7 plus research_toolkit at v1.9.1 per ADR-036) are own-authored so per-library requires-python constraints are under Brandon’s control; the prior bc8ce4e commit selected >=3.13 with rationale baked in; uv.lock provides byte-level reproducibility on top. Phase 0-08 close runs uv sync to verify all three load-bearing libraries install cleanly under requires-python >=3.13; if any library blocks the >=3.13 floor, fix-forward is either (a) bump the upstream library requires-python to >=3.13 in a same-day patch, or (b) loosen this ADR pin to >=3.12 via superseding ADR. Tightening to ==3.13 dot star (option B) rejected — brittle; refuses 3.14+ which is overly strict; uv.lock already provides byte-level reproducibility so exact-minor pin adds no value. Loosening to >=3.12 (option C) rejected — contradicts the already-committed .python-version equals 3.13; uv handles installability transparently so the wider compatibility option C unlocks is not material. Loosening to >=3.11 (option D) rejected — same as C; bigger gap; loses 3.13 features further. Limitation — 3.13 is recent (Oct 2024 release); some downstream wheels may not have 3.13 builds yet — but uv defaults to building from source for missing wheels so installability is preserved (slower first install; not blocking). Extension condition — 3.13-only feature dependency emerges (e.g., free-threaded build for CPU-bound bootstrap loop) tightens the pin via superseding ADR; 3.13 wheel-availability issue on RunPod base images loosens to >=3.12 via superseding ADR; currently expected to not be an issue.

Acceptance criterion

SPEC_GREENFIELD ledger row 310 carries locked-to-requires-python-3-13 status (see ADR-037); pyproject.toml line 7 reads requires-python equals quote >=3.13 quote (unchanged from prior commit bc8ce4e); .python-version file at repo root reads 3.13 (unchanged from prior commit bc8ce4e); uv sync at Phase 0-08 close succeeds without per-library requires-python conflict (verified manually at sub-session close — if any of the three libraries blocks 3.13, this ADR cannot lock until the conflict is resolved via either upstream patch or superseding pin); tests/test_invariants.py contains skip-marked stub test_python_version_pin_at_3_13 asserting (1) pyproject.toml requires-python equals >=3.13; (2) .python-version contains exactly 3.13; (3) sys.version_info major-minor is at least (3, 13) when running pytest (the test fails if invoked on Python <3.13 even though Phase 1 plus implementation defers active enforcement); SUBMISSION_AUDIT.md regenerates from the new ADR.

CLAIM-038 - ADR-038: Phase tailoring — light ROADMAP edits to Phase 4 close + Phase 5 description (preserves 5-phase structure)

Status: Accepted

Source: SPEC_GREENFIELD.md Roadmap ledger row 313 + docs/ROADMAP.md line 83 decision-needed prompt + Phase 0-08 walk Q6

Closing commit/ADR: 5427b95

Claim

Phase 0-08 locks SPEC_GREENFIELD ledger row 313 (Phase tailoring) at light tailoring of docs/ROADMAP.md Phase 4 close plus Phase 5 description while preserving the kit-level 5-phase structure (Phase 1 Data plus Phase 2 Training plus Phase 3 Evaluation plus Phase 4 Analysis plus Phase 5 Writeup). Phase 0-07 expanded the Phase 5 surface (Quarto site per ADR-030 plus 8 spokes per ADR-031 plus HF Hub model card publication per ADR-032 plus rehearsal-tag-fires-publish-pipeline per ADR-033 plus T0+T1+T3 tier-ladder spoke per ADR-034) without altering the phase structure; SPEC_SHEET §2 Phase 5 gate checklist was updated at Phase 0-07 close to reflect these additions but docs/ROADMAP.md was left at kit-level pre-tailoring text. ADR-038 closes the drift by tailoring ROADMAP.md to match. Two surface-area edits — (1) Phase 4 close gains a 2-line note that before exiting Phase 4 the v0.9.0-rc1 rehearsal tag fires per ADR-033 triggering the full publish pipeline (Quarto site build per ADR-030 plus GH Pages deploy plus HF Hub model card pushes per ADR-032) as a 24-plus hour dress-rehearsal with fix-forward via new commits plus v0.9.0-rc2 if rehearsal fails; (2) Phase 5 description rewritten to replace deliverable bundle assembled with Quarto site published to GH Pages via the .github/workflows/publish.yml workflow per ADR-030 plus 8 spokes populated plus index.qmd reading-paths guide complete plus HF Hub model repos for headline rungs published per ADR-032 plus WRITEUP/reproducibility.md documents T0+T1+T3 tier ladder per ADR-034 plus Phase 5 close fires v1.0.0 submission tag per ADR-033 with GH release CHANGELOG plus _site.tar.gz asset. Phases 1-3 description text preserved unchanged — Phase 0-07 additions don’t touch Phases 1-3. The decision-needed prompt at ROADMAP line 83 (project-specific tailoring of the phase structure — e.g., add a Phase 2b for a smoke-train preflight; collapse Phase 3+4 if analysis is light) is answered with no structural restructure — 5-phase frame preserved; the rehearsal is a tag (not a phase) per ADR-033; Phase 4 plus Phase 5 are the right granularity; Phase 2b smoke-train preflight is unnecessary since make smoke per ADR-027 already covers laptop-only fixture-data preflight without a phase split; Phase 3+4 collapse is rejected since Phase 4 carries first-class statistical-inference work (paired-bootstrap plus cv_clt_ci plus MDE plus reference-scorer audit per ADR-022 plus ADR-024) that deserves the same phase-gate discipline as Phase 3 metric computation. Limitation — tailoring forks ROADMAP from the kit-level template; anyone running a future project from the same kit must not copy this project’s ROADMAP back into the kit. Extension condition — Phase 1+ surprise that warrants ROADMAP-level tailoring (e.g., a major training-pipeline pivot that re-orders Phase 2 plus Phase 3 gates) updates ROADMAP via superseding ADR-038 with the new tailoring; reviewer feedback signals splitting Phase 5 into 5a writeup plus 5b publication plus 5c submission gets restructured via superseding ADR — currently below the friction threshold.

Acceptance criterion

SPEC_GREENFIELD ledger row 313 carries locked-to-light-roadmap-edits-phase4-plus-phase5 status (see ADR-038); docs/ROADMAP.md Phase 4 close section gains a paragraph naming the v0.9.0-rc1 rehearsal tag plus the full publish pipeline cite (Quarto site build per ADR-030 plus GH Pages deploy plus HF Hub model card pushes per ADR-032); docs/ROADMAP.md Phase 5 description is rewritten to replace the kit-default deliverable bundle assembled line with the Phase 0-07 additions (Quarto site published plus 8 spokes plus index.qmd plus HF Hub publish plus reproducibility spoke plus v1.0.0 submission tag); docs/ROADMAP.md Phases 1-3 text preserved unchanged (verified via diff); the decision-needed prompt at line 83 is replaced with a brief note pointing at ADR-038 (no structural restructure; rehearsal is a tag not a phase); tests/test_invariants.py contains skip-marked stub test_roadmap_phase4_phase5_tailored asserting (1) docs/ROADMAP.md Phase 4 section contains the v0.9.0-rc1 rehearsal tag name; (2) docs/ROADMAP.md Phase 5 section contains all four ADR citations (ADR-030 plus ADR-031 plus ADR-032 plus ADR-033 plus ADR-034); (3) docs/ROADMAP.md still declares exactly 5 phases past Phase 0 (no Phase 4.5 or 5a/5b splits); SUBMISSION_AUDIT.md regenerates from the new ADR.

CLAIM-039 - ADR-039: Project-specific acceptance criteria — 6-gate integration checklist for v1.0.0 submission tag

Status: Accepted

Source: SPEC_GREENFIELD.md §6 Verify ledger row 351 + Phase 0-08 walk Q7 + SPEC_GREENFIELD §6 line 252 decision-needed prompt

Closing commit/ADR: 5427b95

Claim

Phase 0-08 locks SPEC_GREENFIELD ledger row 351 (Project-specific acceptance criteria) at a 6-gate integration checklist that aggregates across the per-ADR acceptance_criterion fields plus the kit-level §6 verification gates plus the SPEC_SHEET §2 Phase 5 gate checklist. The 6 gates — (1) zero [OPEN] in SPEC_SHEET (every slot reads [LOCKED — value (per ADR-NNN)] or [TBD-at-Phase-N] with explicit rationale); (2) zero open rows in SPEC_GREENFIELD ledger appendix (every row reads locked-to-X (see ADR-NNN) or superseded-by-NNN or deferred-to-phase-N with explicit rationale); (3) all 39-plus tests/test_invariants.py stubs unskipped plus green at submission tag (every @pytest.mark.skip decorator removed; pytest -m unit exits clean); (4) SUBMISSION_AUDIT.md regenerates cleanly with every claim in Accepted or Superseded state (no Proposed claims at submission tag — verified by make audit which wraps scripts/regenerate_audit.py –check); (5) v0.9.0-rc1 rehearsal tag fired successfully before v1.0.0 submission tag per ADR-033 (verified by git tag -l v0.9.0-rc1 showing the tag exists plus the corresponding GH Actions workflow run shows green status); (6) all three reviewer URLs at v1.0.0 resolve — source pin at GitHub tree v1.0.0 plus live Quarto site at GitHub Pages URL plus GH release page with CHANGELOG plus _site.tar.gz asset (per ADR-033). Per-ADR acceptance_criterion fields collectively cover the granular gates (data manifests plus calibration artefacts plus threshold reachability plus HF Hub model card schema plus etc.) — those stay in the ADRs as source of truth; ADR-039 references them rather than restating. Kit-default §6 gates preserved — make test passes (incl. invariants); make lint clean; evals/results.json schema-validated against eval-toolkit results.v1.json schema; all severity-≥-medium assumptions in assumptions.md appear in WRITEUP caveats block. The submission-readiness sign-off lives at SPEC_SHEET §7 (existing kit-default section) plus expanded to include the 6 integration gates plus a per-ADR-criteria reference plus a per-Phase-5-gate reference. Limitation — integration-level gates assume per-ADR acceptance_criterion fields are well-formed; an ADR with vague acceptance criteria can pass aggregation while leaving real gaps; mitigation is the reviewer-protocol at ADR lock time (each new ADR’s acceptance criterion is reviewed for verifiability — already standard practice across the 39 ADRs). Extension condition — Phase 1+ surprise reveals an integration-level gate missing from the 6 (e.g., verify the cost ledger CSV passes schema validation) adds a 7th-or-8th gate via Phase 1+ ADR amendment without superseding ADR-039 (the framing is locked; the specific gates are extensible); methodology revision (post-submission v2.0.0) materially changes the gate set supersedes ADR-039 with a new gates ADR.

Acceptance criterion

SPEC_GREENFIELD ledger row 351 carries locked-to-six-gate-integration-checklist status (see ADR-039); SPEC_SHEET §7 Verification and acceptance criteria gains the 6-gate integration checklist with the per-ADR-criteria pointer plus the kit-default §6 gates preserved; SUBMISSION_TEMPLATE.md or a SUBMISSION.md sign-off section quotes the 6 gates so the submission-readiness check is reviewer-readable at submission tag; tests/test_invariants.py contains skip-marked stub test_submission_readiness_gates_satisfied asserting at v1.0.0 submission tag (1) grep -c open SPEC_GREENFIELD.md decision-ledger appendix section returns 0; (2) grep -c [OPEN] SPEC_SHEET.md returns 0 excluding the Status [OPEN] document-level header which transitions to [LOCKED] at Phase 0 close; (3) pytest –collect-only tests/test_invariants.py shows zero skip-marked tests; (4) make audit exits 0; (5) git tag -l v0.9.0-rc1 returns the tag name (rehearsal fired); (6) the three reviewer URLs return HTTP 200 (or 301 redirect to a 200) — checkable via curl –head; SUBMISSION_AUDIT.md regenerates from the new ADR.

CLAIM-040 - ADR-040: Phase 0 final audit findings + 7-assumption backfill (A-010 through A-016)

Status: Accepted

Source: User request 2026-05-16 (Phase 0 final audit before submission); 3-agent parallel audit findings; plan at the Phase 0 audit synthesis plan file (gitignored; emailed separately at submission)

Closing commit/ADR: cc7ed9f

Claim

At Phase 0 close (after the Phase 0-08 submission tag rehearsal cadence locks), the user requested a final meta-review of the 39-ADR interview surface for (1) unstated assumptions, (2) inconsistent methodology, and (3) source-claim faithfulness. Three Explore agents conducted parallel audits — Agent 1 methodology consistency, Agent 2 unstated assumptions, Agent 3 source faithfulness. The audit produced one substantive actionable finding (7 unstated severity-≥-medium assumptions missing from assumptions.md) plus three false-alarm findings explicitly dismissed (ADR-015 acceptance_criterion staleness, Mosbach 2021 citation year, test-stub count of 39 vs claimed 40). ADR-040 documents (a) the audit cycle as a precedent for future periodic audits, (b) the 7-assumption backfill at A-010 through A-016 with severity calibration rationale (5 high — A-010 plus A-012 plus A-013 plus A-014 plus A-016; 2 medium — A-011 plus A-015 — calibrated by load-bearing-with-vs-without-recovery-primitive ladder), (c) the false-alarm dismissal rationales preserved for audit-trail completeness. Severity calibration philosophy locked — “high” reserved for load-bearing assumptions whose failure requires methodology adjustment without an automated recovery path; “medium” reserved for load-bearing assumptions with built-in recovery primitives (cost-reconciliation post-first-run for A-011; reference-rung subset fallback for A-015). The existing A-001 through A-009 conservative “medium”-only convention is preserved (not retro-calibrated — out of scope). Backfill scope — only Phase 0-07 plus Phase 0-08 introduced external-infrastructure assumptions that the parent ADRs (ADR-020 plus ADR-030 plus ADR-032 plus ADR-033 plus ADR-039) did not register at lock time; Phase 0-00 through Phase 0-06 surfaces had complete registration via A-001 through A-009. Dismissal rationales (preserved for completeness) — (1) ADR-015 acceptance_criterion text says 3 trained rungs × 3 seeds × 5 LODO folds equals 45 parquet files reflecting Phase 0-01 state at lock time; post-ADR-016 plus ADR-017 reality is 4 trained rungs × 3 seeds × 4 LODO folds equals 48 files. ADR-015 immutable per CLAUDE.md immutability discipline; ADR-017 acceptance_criterion correctly carries the post-classical-floor count; SPEC_SHEET §3.2 already reflects 48-file post-ADR-017 reality. NO action — text was correct at lock time. (2) Mosbach 2021 citation in ADR-019 — arXiv:2006.04884 submitted June 2020 plus revised March 2021 plus accepted at ICLR 2021. “Mosbach 2021” matches ICLR conference publication year per standard academic-citation convention. NO action — citation defensible. (3) Test-stub count claimed mismatch (40 vs 39 ADRs) — exact ^@pytest.mark.skip grep returns 39 stubs; 39 = 32 ADR-specific plus 7 kit-level pre-Phase-0 invariants. Earlier 41 count caught docstring substring matches. NO action — counts correct.

Acceptance criterion

decisions/ADR-040-.md exists with adr_id quoted as 040 status Accepted; assumptions.md contains 7 new rows A-010 through A-016 each with severity field of either high or medium per ADR-040 calibration lock (high — A-010 plus A-012 plus A-013 plus A-014 plus A-016; medium — A-011 plus A-015); each new assumption row Linked-to column references ADR-040 plus the parent ADR(s) that introduced the assumption (e.g., A-010 references ADR-030 plus ADR-033 plus ADR-039; A-013 references ADR-016 plus ADR-032 plus ADR-034); assumptions.md trailing TBD note (line 31) extended with Phase 0-07 plus Phase 0-08 closing observation noting why these 7 were back-filled at the audit (parent ADRs did not register at lock time; surfaced by Phase 0 final audit per ADR-040); SPEC_SHEET §8 Linked ADRs trailer extends from ADR-039 to include ADR-040; tests/test_invariants.py contains skip-marked stub test_phase_0_audit_findings_documented asserting (1) ADR-040 file exists at decisions/ADR-040-.md; (2) assumptions.md contains rows for each of A-010 through A-016 (regex grep for the literal ID strings at start-of-row); (3) each new row’s severity field reads high or medium per the calibration lock (5 high plus 2 medium); (4) each new row’s Linked-to column references ADR-040; SUBMISSION_AUDIT.md regenerates with 40 claims total (38 Accepted plus 2 Superseded — CLAIM-002 plus CLAIM-004). Dismissal rationales documented in this ADR body do NOT require additional invariant tests — ADR-015 staleness verified by reading ADR-017 acceptance_criterion plus SPEC_SHEET §3.2; Mosbach 2021 verified by ICLR 2021 publication-year convention; test-stub count verified by grep ^@pytest.mark.skip on tests/test_invariants.py.

CLAIM-041 - ADR-041: Phase 1 (Data) implementation bundle — manifest schema, SHA pinning, validator placement, loader arch, holdout sourcing, contamination corpus, output materialization

Status: Accepted

Source: /exploring-options Phase-1 walk (post-Phase-0 compaction); ADR-016 §Q3 + §Q4 + §Q5 + §Q6 + §Q7 deferred implementation specifics; ADR-026 §Decision tree comment “manifest validation”

Closing commit/ADR: ecfa2b6

Claim

Phase 1 entry locks seven implementation sub-decisions that operationalize ADR-016 (data design bundle) plus ADR-026 (module layout). (Q1) data/source_manifest.yaml uses the rich schema (13 fields per source — name plus hf_id plus type plus revision_sha plus license plus role plus expected_n plus cap plus selection_seed plus language_filter plus subset plus split plus citation_arxiv) at schema_version 1.0 with top-level bump_history list; inlining keeps ADR-016 Q6 ceilings + Q5 LMSYS English filter + dossier provenance grep-able next to the SHAs. (Q2) HF + GitHub SHAs are live-fetched once at Phase 1 entry via scripts/pin_source_manifest.py — huggingface_hub.HfApi.dataset_info for HF datasets; subprocess git ls-remote HEAD for GitHub-cloneable sources (xstest plus BIPIA plus InjecAgent). Re-runs are idempotent if upstream SHAs unchanged; mismatch triggers SHAMismatchError unless –force flag records a bump_history entry per ADR-036 bump-trigger policy. (Q3) manifest_validation.py lives at src/data/manifest_validation.py — ADR-026 §Decision-tree comment explicitly listed “manifest validation” as a src/data/ concern; the 3 files in the §Decision tree are illustrative not exhaustive; no ADR-026 supersession needed. (Q4) src/data/loaders.py uses a single load_source(name) dispatch function with per-source normalize helpers in the same file (option D) — HF datasets.load_dataset(repo, revision=sha) handles fetch + cache via HF default cache; English-only filter applied to lmsys-chat-1m before subsample (per ADR-016 Q5); column normalization produces a uniform (text, label, source) row schema. (Q5) The 50-pair dedup calibration holdout is created via stratified-cosine-band sampling — 5 cosine bands {[0.95-1.0], [0.85-0.95], [0.75-0.85], [0.65-0.75], [0.55-0.65]} times 5 pairs per band; Brandon hand-labels each by visual inspection (ground-truth duplicate-or-not); banding ensures the FPR + FNR measurement at the locked 0.80 threshold actually probes the decision boundary; persisted to data/dedup_holdout.jsonl (gitignored — contains source content). (Q6) The contamination scan reference corpus is the slate-plus-templates blend (option B) — A-006’s “known public training-data mirrors” is interpreted operationally as (a) the 4 train-positive sources themselves (cross-source contamination check) plus (b) approximately 200 templates extracted from HackAPrompt success-pattern metadata (canonical injection-template space per Schulhoff 2023 attack-technique taxonomy); generic-web-text mirrors (The Pile plus C4 plus RedPajama) deferred to afterword as scope-expansion extension. (Q7) Post-dedup output is materialized as per-fold parquet under data/processed/fold-{0..3}/seed-{42,43,44}/{train,val,test}.parquet (48 files); index masks persisted under data/processed/index_masks/ for reverse-trace; Makefile gains 5 granular targets data-fetch + data-dedup + data-splits + data-audit + data-prepare (umbrella). Sub-decisions Q5 plus Q6 are methodology refinements to ADR-016 §Q4 calibration evidence + assumption A-006 acceptance criteria; Q1 + Q2 + Q3 + Q4 + Q7 are implementation specifics flowing from ADR-016 + ADR-026 deferred-to-Phase-1 surfaces.

Acceptance criterion

data/source_manifest.yaml exists at repo root with schema_version equals 1.0 plus bump_history list plus 11 sources each carrying the 13 rich-schema fields; src/data/manifest_validation.py raises ManifestSchemaError on any contract violation (missing field plus wrong type plus role-count mismatch plus slate-completeness mismatch); scripts/pin_source_manifest.py runs idempotently against unchanged remotes; data/dedup_holdout.jsonl exists with 25 plus 25 stratified-cosine-band pairs and is gitignored; evals/contamination_scan.json includes per-eval-row max-cosine to the (slate plus approximately 200 templates) reference corpus; data/processed/fold-N/seed-S/{train,val,test}.parquet emit per the 48-file LODO times seed grid; Makefile carries five granular data- targets plus data-prepare umbrella; tests/test_invariants.py unskips test_source_manifest_schema_valid (Commit 1) plus test_dedup_calibration_persisted (Commit 3) plus test_benign_contamination_scan_clean (Commit 5).

CLAIM-042 - ADR-042: LLM-judge pre-label as Q5 holdout-labeling bootstrap with human override

Status: Accepted

Source: User request to use OpenAI judge bootstrap on Phase 1 Commit 3 holdout (post-build_dedup_holdout.py run)

Closing commit/ADR: df72b01

Claim

ADR-041 Q5 locks Brandon-hand-labeled 50-pair stratified-cosine-band holdout as the dedup calibration ground truth (option A; option D LLM-judge labels was explicitly rejected because judge prior may contaminate calibration). ADR-042 refines (does not supersede) Q5 with a labeling-workflow bootstrap — gpt-4o-2024-08-06 produces a preliminary llm_judge_label per pair plus llm_judge_reasoning rationale plus llm_judge_model snapshot identifier; Brandon hand-examines each pair and either confirms (sets human_label equals llm_judge_label) or overrides (sets human_label to the corrected truth). The calibration script resolves the effective true_duplicate via priority order — human_label takes precedence when non-null else falls back to llm_judge_label. The calibration JSON discloses label_provenance with human_verified_count plus llm_judge_only_count plus human_verified_pct plus llm_judge_model so reviewers can read off exactly which fraction of labels carry human verification. The ground-truth methodology stays Brandon-hand-labeled-when-possible — the LLM bootstrap only fills in initial values that humans review; this preserves ADR-041 Q5’s rejection of pure-LLM-judge labeling while accelerating the labeling workflow. The choice of gpt-4o-2024-08-06 matches the snapshot pinned by ADR-018 for the headline LLM-as-rater (consistency across the LLM-judge usage surface).

Acceptance criterion

scripts/llm_prelabel_dedup_holdout.py exists at repo root and accepts OPENAI_API_KEY via environment plus writes llm_judge_label plus llm_judge_reasoning plus llm_judge_model fields per pair to data/dedup_holdout.jsonl; scripts/calibrate_dedup.py resolves true_duplicate via priority order human_label greater-than llm_judge_label plus persists label_provenance disclosure block in evals/dedup_calibration.json; .env.local is gitignored and accepted by the script via os.environ.get(OPENAI_API_KEY); openai>=1.50 added to pyproject.toml dependencies.

CLAIM-043 - ADR-043: Post-split cross-source leakage cleanup — drop train+val rows that exact-match or cosine-near-match test

Status: Accepted

Source: Phase 1 Commit 5 pipeline run (background task bwgkvoy7z + earlier ble3xg5b2) — leakage_report.json empirical finding

Closing commit/ADR: e4454c0

Claim

Phase 1 Commit 5 pipeline surfaced an empirical leakage finding — 6 exact-hash plus 165 cosine greater-than-or-equal-0.85 train-plus-val versus test overlaps across 12 (fold, seed) splits despite ADR-016 Q4 within-source dedup running on every source. Root cause — ADR-016 Q5 specified cross-source dedup ONLY for benigns; positives weren’t cross-source-deduped because LODO treats each positive source as a unit. But cross-source positive near-paraphrases DO leak across LODO folds when a near-paraphrase of a held-out-source row exists in another source’s train data; this gives the model a “seen it before” advantage on the held-out test. ADR-043 closes this methodology gap by adding a post-split leakage-cleanup pass — for each (fold, seed) split, after make_splits runs but before materialize_splits, src/data/dedup.py::drop_train_test_leakage scans the train+val pool against the held-out test pool; rows in train+val that exact-match or cosine-greater-than-or-equal-0.85-match any test row are dropped (test stays intact). The threshold 0.85 matches the leakage scan threshold in src/data/audit.py::compute_leakage_report (per ADR-016 Q3 hard-locked invariant). The cleanup is implemented as a new function in src/data/dedup.py plus wired through src/data/splits.py::apply_leakage_cleanup which re-partitions the cleaned train+val into the same 80/20 ratio. Pipeline orchestrator scripts/run_data_pipeline.py invokes this between make_splits and materialize_splits. Pre-cleanup pipeline run (without ADR-043) recorded 6 exact plus 165 cosine overlaps; post-cleanup pipeline run records zero overlaps (leakage_clean equals True) honoring ADR-016 Q3 hard-locked leakage invariant. Drop cost — approximately 0.08 percent exact plus 0.17 percent cosine of the train+val pool per split (approximately 171 rows total dropped across 12 splits from a 4707-positive base; test pool unchanged). ADR-043 supersedes the implicit “no cross-source positive dedup” stance in ADR-016 Q5 for the leakage-cleanup step specifically — the rest of ADR-016 Q5 (benign cross-source LMSYS-priority dedup) is preserved unchanged.

Acceptance criterion

src/data/dedup.py::drop_train_test_leakage exists and drops train+val rows with cosine greater-than-or-equal threshold (default 0.85) to any test row; src/data/splits.py::apply_leakage_cleanup applies the function per (fold, seed) split and re-partitions cleaned train+val at 80/20; scripts/run_data_pipeline.py invokes apply_leakage_cleanup between make_splits and materialize_splits; post-cleanup re-run of pipeline records leakage_clean equals True in evals/leakage_report.json (zero exact plus zero cosine overlaps); ADR-043 cleanup_records counts persist in pipeline log for audit; afterword note added to WRITEUP/limitations-and-future-work.md explaining the methodology gap that ADR-043 closes (cross-source positive near-paraphrase leakage was not anticipated by ADR-016 Q5).

CLAIM-044 - ADR-044: Phase 2 training implementation bundle — seed slate reconciliation + manifest move + classical-floor location + YAML config schema + trainer split + per-rung orchestration + fixture preflight

Status: Accepted

Source: Phase 2 walkthrough — /exploring-options Phase-2 seven-question ratify session 2026-05-16 following Phase 1’s precedent (ADR-041 implementation bundle)

Closing commit/ADR: 8c053b0

Claim

Phase 2 entry bundles seven implementation choices closing implementation gaps in ADR-015/017/019/020/026/027 and resolving three pre-lock inconsistencies surfaced at Phase 2 entry. Q1 — seed slate is set to (42, 43, 44) matching Phase 1’s src/data/splits.py SEEDS constant (per ADR-041 materialization at data/processed/fold-N/seed-N/(train, val, test).parquet); body-text partial supersession of ADR-019 line 99 where (42, 1337, 2025) was an arbitrary 3-seed slate at the ADR-006 floor and re-materializing 12 splits would invalidate evals/leakage_report.json + evals/data_audit.json + evals/contamination_scan.json plus 36 parquets plus 36 index masks without methodology gain. The rest of ADR-019 (hyperparameter recipe — r=8, alpha=16, dropout=0.1, target_modules, lr=1e-4, warmup 10 percent, cosine schedule, 2 epochs, bf16, max_len=8192, WeightedTrainer, fp32 softmax cast) is preserved unchanged. Q2 — source manifest moved from data/source_manifest.yaml to configs/data/source_manifest.yaml honoring ADR-026’s locked 5-subpackage layout which reserves data/ as non-committed HF cache plus processed parquets. Q3 — classical floor (TF-IDF + LR per ADR-017) lives at src/training/tfidf_lr.py honoring ADR-026 layout (ADR-017 line 92 mentioned src/rungs/tfidf_lr.py speculatively; ADR-026 is the later authoritative layout lock with no src/rungs/ subpackage). Q4 — per-rung YAML schema is primary source of truth; YAML lives at configs/rungs/.yaml; trainer code reads YAML and instantiates the recipe; ADR-019 is cited at YAML head as the lock with a do-not-edit-without-superseding-ADR comment; honors SPEC-§5 config-hash invariant per ADR-026 line 33 which only works if YAML is canonical. Q5 — trainer split by stack — src/training/train_modernbert.py handles the 3 transformer rungs via HF Trainer plus WeightedTrainer per ADR-019; src/training/train_classical.py handles the sklearn TF-IDF plus LR floor per ADR-017; both write to the uniform predictions parquet schema; satisfies ADR-026 line 73 multi-rung-trainer language by reading train_modernbert.py as the multi-rung trainer for the transformer slate. Q6 — orchestration granularity is per-rung — scripts/train_rung.py with –rung in (frozen_probe, lora, full_ft) sweeps 12 cells per invocation (3 seeds times 4 LODO folds); scripts/train_classical_floor.py runs the classical rung locally on CPU; 4 jobs total (1 local CPU plus 3 GPU); enables per-rung cost-cap budgets and per-rung resumability if a transformer rung fails mid-sweep. Q7 — Phase 2 extension of make smoke per ADR-027 line 75 wires configs/profiles/fixtures.yaml plus tests/fixtures/(parquet files) plus a tiny-data trainer path (approximately 50 examples times 4 sources times 1 fold times 1 seed times 1 epoch) for a sub-5-minute laptop CPU pipeline pass closing ADR-027’s deferred fixture-pipeline wiring. Implementation cadence — 6 commits following Phase 1 precedent — Commit 1 (this commit) does manifest move + ADR-044 + paths; Commit 2 lands training primitives per ADR-019/020 (load_modernbert + lora_config + weighted_trainer + batch_table + training_args + softmax_cast); Commit 3 lands classical floor; Commit 4 lands ModernBERT trainer + 3 rung YAMLs; Commit 5 lands per-rung RunPod configs + train_rung.py orchestrator + cost_rollup.py + library_imports.md updates; Commit 6 lands fixtures + smoke pipeline + Makefile Phase 2 targets + ROADMAP Phase 2 close note.

Acceptance criterion

configs/data/source_manifest.yaml exists at the new location and data/source_manifest.yaml no longer exists (git mv operation); all path references in src/data/loaders.py + src/data/manifest_validation.py + src/data/templates.py + scripts/pin_source_manifest.py + tests/test_invariants.py + Makefile + SPEC_SHEET.md + SPEC_GREENFIELD.md + assumptions.md + .gitignore are updated to the new path; test_source_manifest_schema_valid invariant still passes against the new path; ADR-019 frontmatter status remains Accepted (partial body-text supersession of seed slate only, recipe unchanged); 4-rung trained slate enumeration (classical_floor, frozen_probe, lora, full_ft) is implemented in subsequent commits with all rungs landing under src/training/; per-rung YAML configs land at configs/rungs/.yaml with ADR-019 cited at YAML head; per-rung orchestration via scripts/train_rung.py per ADR-044 Q6; Phase 2 make smoke extension lands per ADR-044 Q7 + ADR-027 line 75 wiring; 6-commit cadence closes with docs/ROADMAP.md Phase 2 close note + SUBMISSION_AUDIT regen + transcript checkpoint.

CLAIM-045 - ADR-045: Phase 3 evaluation implementation bundle — scoring-first contract + 6-commit cadence + tiered reference scorers + classical-scaffold + full-pairwise persistence with headline-only WRITEUP + pydantic schema validation

Status: Accepted

Source: Phase 3 walkthrough — /exploring-options “start phase 3” seven-question ratify session 2026-05-16 following Phase 1 (ADR-041) plus Phase 2 (ADR-044) precedent

Closing commit/ADR: c406f58

Claim

Phase 3 entry bundles seven implementation choices closing implementation gaps in ADR-018/021/022/023/024/025/026/034 left open after Phase 0 lock and Phase 2 close. Q1 — pre-Phase-3 housekeeping is already satisfied (assumptions A-010 through A-016 backfilled in assumptions.md via the ADR-040 cycle 2026-05-16); no warm-up commit required. Q2 — Phase 3 ships in a 6-commit cadence mirroring Phase 2’s proven pattern (ADR-044 precedent) — Commit 1 (this commit) does ADR-045 plus SPEC_SHEET §3.7 Phase 3 status table; Commit 2 lands src/scoring/{protectai, lakera_api, llm_judge}.py per ADR-018 (4 reference rungs at unified predictions parquet schema); Commit 3 lands src/eval/calibration_battery.py per ADR-023 (4-ECE matrix plus Brier plus reliability plus temperature plus isotonic interventions; validation-only fit per ADR-011 Guarantee 6); Commit 4 lands src/eval/{operating_points, slice_analysis}.py per ADR-025 plus ADR-021 (dual-policy thresholds with verification-reachability audit per A-009 plus 5-slice OOD aggregation with pooled-headline plus per-slice spoke); Commit 5 lands scripts/{fit_dual_policy_thresholds, run_metrics_battery, run_bootstrap_battery, eval_from_hub}.py per ADR-022/024/034 (full-pairwise paired-bootstrap persistence per Q6 below plus joblib orchestrator-layer parallelization on 64-core Threadripper plus T0-tier eval-from-hub reproducibility); Commit 6 closes with Makefile Phase 3 targets plus fixture-extension smoke pipeline plus ROADMAP Phase 3 close note plus Phase 4 unblock. Q3 — scoring layer ships first (Commit 2) before metric layer (Commits 3-4) so the uniform per-row predictions parquet schema serves as the canonical contract for all downstream consumers; the metric layer is scorer-agnostic via the contract — mirrors Phase 1’s manifest-as-canonical pattern per ADR-041. Q4 — reference scorer execution is tiered — Tier A (ProtectAI v1 plus v2; free local HF inference per ADR-018; runs in CI smoke) lands in Commit 2 unconditionally; Tier B (gpt-4o-2024-08-06 plus claude-sonnet-4-6 LLM judges plus paid Lakera fallback if needed; paid APIs) ships as scripts/run_reference_scorers.py --tier paid with interactive approval prompt plus --dry-run cost preview mirroring Phase 2’s make headline-* pattern from ADR-020; LLM judge cache infrastructure at evals/audit/llm_judge_cache/<judge>__<row_hash>.json per A-007 plus A-014 (cache survives mid-Phase deprecation). Q5 — transformer-output dependence is handled via scaffold-with-classical-floor smoke discipline — all src/eval/ modules consume any predictions parquet matching the schema contract; smoke tests use the 12 classical-floor parquets plus the tiny fixture parquets at tests/fixtures/processed/; transformer-pred-consuming invariants (e.g., test_per_epoch_predictions_present per ADR-044) remain @pytest.mark.skip until the 72 transformer parquets exist (canonical GPU runs operator-gated per ADR-020); precedent matches ADR-027 fixture-first smoke discipline. Q6 — bootstrap battery scope persists full pairwise (6 rung-vs-rung comparisons across 4 rungs (classical_floor, frozen_probe, lora, full_ft) times 5 OOD slices times pooled levels — approximately 30 paired-bootstrap cells) but the WRITEUP narrative features only the 3 headline comparisons (classical-floor vs frozen-probe — does pretraining help; frozen-probe vs LoRA — does adaptation help; LoRA vs full-FT — is parameter efficiency worth it); persistence is the methodology contract per ADR-013 (post-hoc questions answered from disk without re-running the bootstrap); multi-comparison correction acknowledgment per ADR-022 covers the WRITEUP-featured set only. Q7 — schema validation uses pydantic v2 throughout src/eval/ and src/scoring/PredictionsRowModel, MetricsRecordModel, SliceMetricsModel, OperatingPointModel, CalibrationRecordModel, ReachabilityAuditModel are BaseModel classes validated on read/write; consistent with Phase 1’s configs/data/source_manifest.yaml validation via pydantic per ADR-041. Implementation cadence follows Phase 1 plus Phase 2 precedent — each commit ships green-CI surface; ADR-045 cited in subsequent commits as Q-N for specific decisions.

Acceptance criterion

decisions/ADR-045-phase-3-evaluation-implementation-bundle.md exists at this path with Accepted status; SPEC_SHEET.md §3.7 Phase 3 implementation status table added mirroring §3.6 Phase 2 pattern with per-commit rows tracking green status; SUBMISSION_AUDIT.md regenerates via scripts/regenerate_audit.py with ADR-045 included; uniform predictions parquet schema (rung, fold, seed, row_idx_in_source, source, text, label, predicted_proba_class1, contamination_state) implemented as pydantic PredictionsRowModel in src/eval/schemas.py landing in Commit 2; src/scoring/{protectai, lakera_api, llm_judge}.py implementing the 4 reference rungs per ADR-018 land in Commit 2 with Tier A (ProtectAI) in CI smoke and Tier B (LLM judges plus Lakera) gated on paid-API approval; src/eval/calibration_battery.py wires the eval-toolkit ECE 4-variant matrix plus Brier plus reliability plus temperature plus isotonic interventions per ADR-023 lands in Commit 3; src/eval/{operating_points, slice_analysis}.py implementing ADR-025 dual-policy thresholds plus ADR-021 5-slice OOD aggregation land in Commit 4; scripts/{fit_dual_policy_thresholds, run_metrics_battery, run_bootstrap_battery, eval_from_hub}.py implementing per-rung orchestration plus full-pairwise paired-bootstrap persistence plus T0-tier eval-from-hub land in Commit 5; LLM judge cache infrastructure at evals/audit/llm_judge_cache/__.json operational per A-007 plus A-014; Makefile Phase 3 targets (eval-classical-floor, eval-reference-scorers-free, eval-reference-scorers-paid, calibration-battery, dual-policy-thresholds, bootstrap-battery, eval-from-hub, metrics-battery) plus tests/smoke/test_smoke_pipeline.py extension covering end-to-end calibration plus threshold-fit pass on classical-floor fixture predictions land in Commit 6; docs/ROADMAP.md Phase 3 close note added with deliverables plus operator follow-ups plus Phase 4 unblock confirmation; transcript checkpoint at transcripts/2026-05-16__phase-3-implementation.md captured via /save-transcript; bootstrap battery cell count equals approximately 30 (6 pairwise times 5 slices) persisted to evals/bootstrap/ with WRITEUP featuring 3 headline comparisons (classical-floor vs frozen-probe plus frozen-probe vs LoRA plus LoRA vs full-FT); contamination_state column carried through every predictions parquet plus every metrics parquet per ADR-005 plus ADR-018 four-tier taxonomy; verification reachability audit JSON at evals/audit/verification_reachability.json schema validated via pydantic ReachabilityAuditModel per ADR-025 plus A-009.

CLAIM-046 - ADR-046: Phase 4 analysis implementation bundle — 6-commit cadence + scaffold-with-classical + always-emit-both-CIs auto-flag + MDE-on-every-emitted-CI + LLM-rater audit included (user-overridden from defer) + library-first hybrid figures + Phase 5 prep deferred

Status: Accepted

Source: Phase 4 walkthrough — /exploring-options 7-question Phase 4 ratify session 2026-05-16 following Phase 1 (ADR-041) plus Phase 2 (ADR-044) plus Phase 3 (ADR-045) precedent; user override on Q5 (defer recommendation rejected; include-LLM-rater-audit-now locked); user reaffirmation at Q6 reframed library-first as project-wide invariant requiring retroactive Phase 1 audit per ADR-047

Closing commit/ADR: 70e34fd

Claim

Phase 4 entry bundles seven implementation choices closing implementation gaps in ADR-006/018/021/022/023/024/025/031/033/038 left open after Phase 0 lock and Phase 3 close. Q1 — Phase 4 ships in a 6-commit cadence mirroring Phase 2 + Phase 3 proven pattern — Commit 1 (this commit) does ADR-046 plus SPEC_SHEET §3.8; Commit 2 lands marginal bootstrap_ci primitives plus cv_clt_ci primitives plus mde_from_ci primitives wired through src/eval/; Commit 3 lands src/eval/cross_fold_ci.py with always-emit block-bootstrap-on-folds spoke per A-008 plus auto-flag column when block_bootstrap_CI_halfwidth divided-by cv_clt_CI_halfwidth exceeds 1.5; Commit 4 lands src/eval/figures.py with library-first hybrid renderers consuming eval_toolkit.plotting.{plot_pr_curve, plot_reliability_diagram, plot_bootstrap_distribution, plot_metric_bars, plot_lift_ci, save_figure, PALETTE, set_plot_style} for the 4 in-scope primitives plus project glue for the 5 gap-filling figures (F1 Pareto, F2 ROC, F5 slice heatmap, F6 LODO breakdown, F7 dual-policy grid layout) citing upstream issues

Acceptance criterion

decisions/ADR-046-phase-4-analysis-implementation-bundle.md exists at this path with Accepted status; SPEC_SHEET.md §3.8 Phase 4 implementation status table added mirroring §3.7 Phase 3 pattern with per-commit rows tracking green status; SUBMISSION_AUDIT.md regenerates via scripts/regenerate_audit.py with ADR-046 included; src/eval/cross_fold_ci.py implementing always-emit-both cv_clt_ci plus block-bootstrap-on-folds plus a_008_flag_fired column per ADR-024 plus A-008 lands in Commit 3; src/eval/figures.py implementing the 7-figure slate as library-first hybrid (F3 plus F4 plus F7 sub-panels via eval-toolkit primitives; F1 plus F2 plus F5 plus F6 plus F7 grid via project glue with TODOs citing upstream issues

CLAIM-047 - ADR-047: Phase 1 library-first carryforward refactor — audit findings + remediation plan (splits + dedup + leakage_report + contamination_scan migration to eval-toolkit primitives; orphaned code removed in-commit)

Status: Accepted

Source: Phase 4 entry walkthrough — /exploring-options 7-question Phase 4 ratify session 2026-05-16; user reaffirmation at Q6 reframed library-first as project-wide invariant requiring retroactive audit; audit pass executed in-session against Phases 1 plus 2 plus 3 (shipped) plus Phase 5 (planned)

Closing commit/ADR: ab8a501

Claim

Phase 4 entry walkthrough surfaced a project-wide library-first invariant reaffirmed by the user at Q6 (figures slate) — “we are not handrolling tools that can be better handled by my eval-toolkit and separately tested there with golden eval data sets” — applied retroactively to Phases 1 plus 2 plus 3 (shipped) and prospectively to Phase 5 (planned). The audit pass against src/data/ plus src/training/ plus src/scoring/ plus src/eval/ identified four confirmed Phase 1 hand-rolls — src/data/splits.py::make_splits reimplements the source-disjoint k-fold partition that eval-toolkit’s SourceDisjointKFoldSplitter was literally abstracted from a predecessor of this project to provide (per its own docstring); src/data/dedup.py::{dedup_within_source, drop_train_test_leakage, dedup_cross_source_benigns} reimplements the greedy-near-dedup + cross-corpus dedup machinery that eval-toolkit’s near_dedup plus cross_dedup plus EmbeddingCosineStrategy already provide turnkey; src/data/audit.py::compute_leakage_report reimplements the train↔︎eval leakage detection that eval-toolkit’s run_leakage_checks([ExactDuplicateCheck(), NearDuplicateCheck(...), CrossSplitLeakageCheck()]) already orchestrates; src/data/audit.py::compute_contamination_scan reimplements per-row max-cosine-to-reference-corpus machinery that eval-toolkit’s SimilarityStrategy.pairs_across(query, reference, k=1) already provides via the strategy Protocol. Phase 2 (src/training/) and Phase 3 (src/scoring/ + src/eval/{calibration_battery, operating_points, slice_analysis}.py) are confirmed clean — Phase 2 uses HF Transformers plus LoRA plus sklearn (eval-toolkit doesn’t compete), Phase 3 scoring is project-specific LLM-judge plus ProtectAI wrappers, and Phase 3 eval is already library-first per decisions/library_imports.md ledger lines 39 through 53. Phase 5 (Quarto site plus model card generation plus WRITEUP authoring plus HF Hub upload) is confirmed clean in pre-audit. Remediation plan — four refactor commits, each consuming the upstream primitive plus deleting now-unreachable local helpers in-commit per the no-orphaned-code-during-refactor discipline (saved as memory 2026-05-16); local project-owned embedder glue (src/data/dedup.py::{get_encoder, compute_embeddings, encoder_revision_sha} — sentence-transformer all-MiniLM-L6-v2 loader plus SHA-pinning plus batched encoding) preserved as the embedder= callable passed to EmbeddingCosineStrategy(embedder=compute_embeddings); locked constants preserved bit-for-bit (within-source dedup THRESHOLD=0.80 per ADR-016 Q4; cross-source dedup THRESHOLD=0.80 + LMSYS-priority cross-source dedup per ADR-016 Q4-A; train↔︎test leakage cleanup threshold=0.85 per ADR-043; benign contamination threshold=0.85 per ADR-041 Q6; CONTAMINATION_THRESHOLD plus BENIGN_CONTAMINATION_THRESHOLD_PCT trigger gates per ADR-041 Q6 + ADR-016 A-005); evals/leakage_report.json schema migration required (project-dict to LeakageReport-derived JSON or adapter wrapper) — preserved-callers updated in the same commit. Two upstream contributions filed — issue

Acceptance criterion

decisions/ADR-047-phase-1-library-first-carryforward-refactor.md exists at this path with Accepted status; SPEC_SHEET.md §3.5 (Phase 1 status table) gains an audit-findings row tracking the 4-refactor sequence (Commits 1-4) with green status as each lands; SUBMISSION_AUDIT.md regenerates via scripts/regenerate_audit.py with ADR-047 included; decisions/upstream_issues.md ledger reflects issues

CLAIM-048 - ADR-048: LLM-rater reference-scorer audit protocol — disagreement-with-LoRA baseline + stratified (5 OOD slices × 3 contamination_states) sampling + fixed 2-axis rubric + inter-rater gpt-4o-plus-claude-sonnet + dry-run-then-single-approval cost UX

Status: Accepted

Source: Phase 4 Commits 2-6 tactical walkthrough — /exploring-options “Phase 4” Q4 5-question session 2026-05-16 (same conversation as ADR-046 plus ADR-047); user requested in-depth explanation of LLM-rater audit design before ratifying; locked Option A1 (disagreement-with-LoRA + stratified + fixed 2-axis rubric); per CLAUDE.md anti-pattern “Adding a methodology component without an ADR” the methodology decision (sampling protocol plus rubric design) warrants its own ADR

Closing commit/ADR: 0fad4e1

Claim

ADR-046 Q5 user-overrode the original defer recommendation to include the reference-scorer LLM-rater audit as a real Phase 4 deliverable (citing the value of front-loading the audit rather than waiting for a regex-tagger-conservative-enough trigger that may never fire). ADR-048 extends ADR-046 Q5 with the methodology details that turn that decision into an executable protocol — scripts/audit_reference_scorers.py (lands in Phase 4 Commit 5 per ADR-046 Q1 6-commit cadence) samples approximately 50 prediction-pairs per reference rung (R-LLM-OpenAI plus R-LLM-Anthropic plus R-ProtectAI-v1 plus R-ProtectAI-v2 per ADR-018) where the reference scorer’s prediction disagrees with the LoRA trained-rung classifier (sign(reference_proba > 0.5) != sign(lora_proba > 0.5)); LoRA is the disagreement baseline (rather than full-FT or classical-floor) because LoRA is the cheapest GPU-trained rung per ADR-019 ($60/pod cap vs full-FT’s $100/pod cap per ADR-020) plus the intermediate-quality matched-budget comparison point per ADR-018, balancing audit-narrative realism (we trained a real model, not just a probe) with operator-cost feasibility (the audit can fire as soon as make headline-lora lands, ahead of full-FT). Sampling is stratified across the cross-product of 5 OOD slices per ADR-021 (NotInject plus XSTest plus JBB plus BIPIA plus InjecAgent) and 3 contamination_states per ADR-005 (clean plus suspected_contamination plus vendor_black_box) — approximately 3 pairs per (slice, contamination_state) stratum where feasible, totaling approximately 45-50 pairs per reference rung; stratification is necessary because uniform random sampling would heavily over-sample the largest slices (BIPIA plus InjecAgent dwarf the smaller slices) and miss failure modes in rare strata. Each sampled pair is rated by BOTH gpt-4o-2024-08-06 and claude-sonnet-4-6 (same snapshot pins as ADR-018 LLM-judge slate per ADR-042 dedup-holdout precedent for inter-rater reliability) producing two independent rubric records per pair; per-pair rubric is fixed 2-axis (boolean rater_judgment_correct plus ordinal calibration_assessment with three levels overconfident-well_calibrated-underconfident) plus optional string rater_notes for qualitative observations without burdening the aggregation axes; inter-rater agreement (Cohen’s kappa on rater_judgment_correct plus weighted-kappa on the ordinal calibration_assessment) reported in the audit JSON for methodology transparency. Aggregated statistics that the WRITEUP can cite directly — per-reference-rung percent-judged-correct on disagreement cases (the headline reference-scorer-trustworthiness number); per-reference-rung calibration distribution (overconfident-vs-well-vs-underconfident percentages); per-(slice, contamination_state) breakdown surfaces failure modes; inter-rater agreement number scopes confidence in the audit itself. Cost envelope — 4 reference rungs times approximately 50 pairs times 2 LLM-rater calls per pair equals approximately 400 LLM calls at approximately $0.005 per call equals approximately $2 plus prompt-template overhead approximately $3 totaling approximately $5 per A-002 envelope (well under $1 per reference rung); cost UX follows Phase 3 ADR-045 Q4 plus ADR-020 pattern — python scripts/audit_reference_scorers.py --dry-run previews exact pair count plus per-rung cost estimate; non-dry-run requires single interactive approval prompt before any LLM calls fire; per-rung approvals are over-gating for a sub-$5 audit. Output persistence — evals/audit/reference_scorer_rater_audit.json (one top-level key per reference rung plus nested per-rater plus inter-rater agreement) validated against ReferenceScorerRaterAuditModel per ADR-046 Q2 schema-placement-in-src/eval/schemas.py decision; LLM judge cache infrastructure at evals/audit/llm_judge_cache/<judge>__<sha256-prefix>.json per A-007 plus A-014 reused so cache survives cross-run plus mid-Phase-deprecation. Phase 4 WRITEUP spoke WRITEUP/reference-scorer-audit.md (drafted in Phase 5 per ADR-046 Q7 phase-tailoring) consumes the audit JSON to populate the per-rung trustworthiness headline plus calibration-distribution figure plus methodology narrative. Operator dependency — audit fires after make headline-lora lands (operator-gated GPU run per ADR-020); does NOT require full-FT or frozen-probe runs; can fire ahead of full Phase 4 canonical numbers if operator prioritizes the audit-narrative deliverable.

Acceptance criterion

decisions/ADR-048-llm-rater-reference-scorer-audit-protocol.md exists at this path with Accepted status; SUBMISSION_AUDIT.md regenerates via scripts/regenerate_audit.py with ADR-048 included; src/eval/schemas.py extended with ReferenceScorerRaterAuditModel pydantic v2 BaseModel at Phase 4 Commit 2 landing per ADR-046 Q2 + this ADR’s persistence layout; scripts/audit_reference_scorers.py at Phase 4 Commit 5 landing implements (a) –dry-run cost preview surfacing exact pair count plus per-rung cost estimate before any LLM call fires, (b) sampling protocol per this ADR — disagreement-with-LoRA via sign(reference_proba > 0.5) != sign(lora_proba > 0.5) filter applied to predictions parquets at evals/predictions/ for each reference rung in (R-LLM-OpenAI, R-LLM-Anthropic, R-ProtectAI-v1, R-ProtectAI-v2) per ADR-018, stratified by (slice, contamination_state) cross-product per ADR-021 + ADR-005, targeting approximately 3 pairs per stratum totaling approximately 50 per reference rung, (c) inter-rater protocol — each sampled pair rated by BOTH gpt-4o-2024-08-06 and claude-sonnet-4-6 via the fixed 2-axis rubric (boolean rater_judgment_correct plus ordinal calibration_assessment plus optional string rater_notes), (d) single interactive approval prompt before non-dry-run firing per ADR-020 + ADR-045 Q4, (e) LLM judge cache reused at evals/audit/llm_judge_cache/__.json per A-007 + A-014, (f) results persisted to evals/audit/reference_scorer_rater_audit.json with one top-level key per reference rung plus per-rater rubric records plus inter-rater Cohen’s kappa on rater_judgment_correct plus weighted-kappa on calibration_assessment plus per-(slice, contamination_state) breakdown plus headline percent-judged-correct number; Makefile target audit-reference-scorers at Phase 4 Commit 6 landing wraps the script with the cost-cap interactive-approval gate; cost envelope tracked in evals/cost_ledger.csv per ADR-020 ledger discipline plus integration with cost-rollup-check; transcript checkpoint at transcripts/2026-05-16__phase-4-entry-plus-phase-1-library-first-refactor.md captures the Q4 in-depth walkthrough where this decision was ratified.

CLAIM-049 - ADR-049: GPU failover ladder priority refresh — A100-SXM4-80GB to position 1 + US-WA-1/US-CA-2 datacenter additions + EU-RO-1 drop (frozen-probe + LoRA canonical; full-FT deferred to post-rehearsal)

Status: Accepted

Source: Phase 4 Commit 6 close + canonical-run plan walkthrough — /AskUserQuestion 3-question session 2026-05-17 (full-FT plan + ADR-049 scope + commit batching); user-locked Defer-to-post-rehearsal + Narrow-supersession + All-3-commits-now; live runpod-deploy gpu-list + validate output across 6 RunPod datacenters captured as the rationale for the position refresh

Closing commit/ADR: 423c2c8

Claim

ADR-020 locks the GPU failover ladder set + tier structure + dual-layer cost cap discipline + BATCH_TABLE invariant. ADR-049 supersedes ONLY the gpu_order POSITION within the ladder (the SET is intact) plus the pod.datacenters list — both fall outside ADR-020’s locked-by-ADR layer (the ladder set is locked; the order within a tier is a config detail). Live RunPod stock observation on 2026-05-17 00:30 UTC at Phase 4 close surfaced the operational reality that the configured priority (H100/H200 at positions 1-6) is dry across every datacenter probed — US-MD-1 had no H100/H200 stock (H200 NVL present but stock empty), EU-RO-1 had no Hopper SKUs in DC at all plus A100-80G stock empty, US-CA-2 had Hopper SKUs present but stock empty, CA-MTL-1 had H100 80GB HBM3 present but stock empty, EU-NL-1 had H100 80GB HBM3 present but stock empty. Only A100-SXM4-80GB at Low stock in US-MD-1 plus US-WA-1 is a provisionable target. Refreshed pod.gpu_order per ADR-049 reorders the existing ladder set to (1) NVIDIA A100-SXM4-80GB, (2) NVIDIA A100 80GB PCIe (T2 retained), (3-6) NVIDIA H100 80GB HBM3 plus H100 NVL plus H100 SXM plus H100 PCIe (T1 retained as later fallback in case mid-Phase-2 stock recovers), (7) NVIDIA H200, (8) NVIDIA H200 NVL, (9) NVIDIA L40S (T3 retained per ADR-020 flash-attn-fallback recipe), (10) NVIDIA A100-SXM4-40GB (T4 emergency retained). Refreshed pod.datacenters per ADR-049 reorders to [US-MD-1, US-WA-1, US-CA-2] — US-WA-1 added as second A100-SXM4-80GB candidate (currently Low stock; doubles provisioning success probability); US-CA-2 added as nominal fallback (A100-SXM4-80GB present but stock empty; recovery candidate); EU-RO-1 dropped (no A100-80G stock plus no Hopper SKUs in DC; dead weight in the failover chain). Per-rung scope — frozen-probe plus LoRA canonical runs fire under the refreshed configs (cap $40 + $60 per ADR-020 retained); full-FT canonical run DEFERRED to post-rehearsal per user direction 2026-05-17 — fires after v0.9.0-rc1 dress-rehearsal completes cleanly per ADR-033 + ADR-046 Q7, gated by a follow-up ADR superseding the defer status when fired. Rationale for defer — full-FT on A100-SXM4-80GB at 12 cells times approximately 30-60 min per cell equals 6-12 hours wall-time on a Low-stock GPU class is operationally fragile (mid-run preemption plus race-with-other-customers risk grows with wall-time); frozen-probe plus LoRA together fit approximately 4-8 hours wall plus approximately $13-25 actual cost (well under $100 combined cap); validates the pipeline end-to-end without exposing the longest-running rung to the highest-risk stock conditions. Operator follow-up after this ADR lands — refresh configs/runpod/headline-{frozen_probe, lora, full_ft}.yaml with the new gpu_order + datacenters (same change to all 3 so full-FT is ready when stock improves); make headline-dry-run should now validate cleanly with no all-configured-GPUs-Low-stock warning; operator fires make headline-frozen-probe then make headline-lora sequentially with interactive approval gates per ADR-020.

Acceptance criterion

decisions/ADR-049-gpu-order-priority-refresh-a100-80g.md exists at this path with Accepted status; SUBMISSION_AUDIT.md regenerates via scripts/regenerate_audit.py with ADR-049 included and ADR-020 marked superseded-in-part-by-049 on the gpu_order priority axis only; configs/runpod/headline-frozen_probe.yaml + configs/runpod/headline-lora.yaml + configs/runpod/headline-full_ft.yaml all refresh pod.gpu_order to (A100-SXM4-80GB first; A100 80GB PCIe second; H100/H200 family positions 3-8; L40S position 9; A100-SXM4-40GB position 10) and pod.datacenters to [US-MD-1, US-WA-1, US-CA-2]; runpod-deploy validate –config configs/runpod/headline-frozen_probe.yaml –all passes without the “all configured GPUs are Low stock” warning that fired pre-ADR-049 (validator reports A100-SXM4-80GB at Low stock as the matched candidate instead); docs/ROADMAP.md Phase 4 close note rehearsal-tag dispatch checklist amended to add a Step 7 explicitly noting that full-FT canonical run fires post-rehearsal only and requires a follow-up ADR superseding ADR-049’s defer status when fired; decisions/library_imports.md runpod-deploy section unchanged (the primitives invoked are identical; only the config payload values change); CLAUDE.md anti-pattern “Mutating a locked decision without writing a superseding ADR” satisfied — this ADR is the explicit supersession trail for the position-order half of ADR-020 GPU failover ladder; the SET + TIER STRUCTURE + dual-layer cost cap discipline + BATCH_TABLE invariant from ADR-020 are unchanged + remain locked.

CLAIM-050 - ADR-050: Rung slate narrowing — LLM judges dropped (cost) + full-FT OOD dropped (FUSE EIO crash); narrow supersession of ADR-018 reference slate + ADR-021 OOD comparison scope

Status: Accepted

Source: Phase 4-5 canonical recovery — /exploring-options Round 2 Q1 (LLM-judge cost) + Round 4 (FUSE crash recovery) + Round-by-round /exploring-options walkthrough 2026-05-17 through 2026-05-18.

Closing commit/ADR: 3b16036

Claim

Phase 4-5 canonical recovery surfaced two rung-slate revisions that materially change the reference-comparison narrative and require explicit ADR documentation per the project anti-pattern “Mutating a locked decision without writing a superseding ADR”. Revision 1 (cost — LLM judges dropped) — ADR-018 locked the four-rung reference slate as gpt-4o-2024-08-06 + claude-sonnet-4-6 + ProtectAI v1 + ProtectAI v2 with an estimated cost envelope of approximately 14 USD across both LLM judges plus the LLM-rater audit; Phase 4 cost re-estimation against the actual OOD slate sizing (~24k val rows + ~13k pooled OOD slices for both detection comparison + rater audit) revealed an actual envelope closer to 240 USD (16x the original estimate) — driven by (a) per-row LLM-judge inference being charged at the full input-prompt token count whereas the original estimate assumed shorter-prompt heuristics, plus (b) the rater-audit disagreement-sampled cohort scaling with total prediction volume rather than a fixed ~50-pair cohort. User-locked decision (Round 2 Q1, 2026-05-17 ultra-think risk surfacing) drop LLM judges entirely from the reference comparison. The TF-IDF+LR classical floor (verified_disjoint) + ProtectAI v1 + v2 (suspected_contamination) remain as the reference rungs; the vendor_black_box tier is empty. ADR-018 contamination-stratification narrative still holds for the remaining rungs; the four-tier disclosure gradient compresses to three tiers (verified_disjoint, backbone-partial-disjoint, suspected_contamination). Revision 2 (operational — full-FT OOD dropped) — ADR-019 + ADR-021 locked the trained-rung slate as classical-floor + frozen-probe + LoRA + full-FT with OOD scoring on all 3 transformer rungs across the 5-slice OOD slate. Phase 5 X11 full-FT re-fire (configured per the runpod-rsync-everything-before-delete memory rule to persist checkpoints via cleanup_intermediate_checkpoints set to false) crashed mid-training on a FUSE EIO during shutil.copytree of the 598 MB optimizer.pt to /workspace MooseFS-backed storage; the trainer crashed before any cell final checkpoint persisted to durable storage. The pre-crash full-FT LODO predictions (24 parquets from Phase 2 original orchestrator-fired full-FT run) survived; full-FT OOD inference is methodologically impossible without re-firing a 6-12 hour A100 80GB pod plus a likely repeat of the FUSE crash. User-locked decision (Round 4, 2026-05-18 FUSE-crash recovery) abandon full-FT OOD; ship 2-rung OOD via the alive pod (frozen-probe + LoRA). full-FT remains in the LODO comparison (3-rung ladder narrative — frozen-probe to LoRA to full-FT — holds on the LODO held-out attack-source generalization test); OOD comparison drops to 2 trained rungs + 1 classical floor + 2 reference scorers (ProtectAI v1 + v2) = 5 rungs total. Consequences for the WRITEUP narrative — the methodology spoke gains a Limitations subsection naming both drops with the operational rationale (LLM-judge cost overrun + FUSE EIO crash); the headline AUPRC-vs-rung chart on OOD shows 5 rungs not 6; the LODO chart shows the full 3-rung trained ladder. Per-axis matched-budget framing from ADR-018 is unchanged. ADR-005 contamination taxonomy retains all 4 tier labels but the vendor_black_box tier carries 0 rungs in this submission.

Acceptance criterion

decisions/ADR-050-rung-slate-narrowing-llm-judges-and-full-ft-ood-dropped.md exists at this path with Accepted status; SUBMISSION_AUDIT.md regenerates via scripts/regenerate_audit.py with ADR-050 included and ADR-018 marked superseded-in-part-by-050 on the reference-rung enumeration axis (specifically the gpt-4o + claude-sonnet-4-6 entries) and ADR-021 marked superseded-in-part-by-050 on the trained-rung OOD comparison scope (specifically the full-FT OOD entries); evals/predictions/ contains 0 LLM-judge predictions + 0 full-FT OOD predictions; SPEC_SHEET §4 reference-rung enumeration is annotated to indicate that LLM-judge rungs were dropped post-lock per ADR-050 with a one-line cost rationale; the WRITEUP methodology spoke gains a Limitations subsection covering both rung drops with cost-overrun rationale for LLM judges and FUSE-EIO operational rationale for full-FT OOD; the OOD analysis chart legend shows 5 rungs (not 6); the LODO analysis chart legend shows 3 trained rungs (frozen-probe + LoRA + full-FT) where full-FT remains in scope per the surviving Phase 2 LODO predictions; A-006 contamination-caveat assumption is unchanged (the three remaining contamination tiers are still active); cost_ledger.csv shows 0 USD LLM-judge entries and the full-FT re-fire entry carries the failed status + manual_recovery true + the FUSE EIO crash notes.

CLAIM-051 - ADR-051: v1.0.x carryforward of ADR-034 T0 score-match wiring + ADR-039 gate-3 invariant-scaffold unskip; narrow supersession with explicit v1.1.x landing condition

Status: Accepted

Source: REPO_AUDIT_2026-05-18.md §P0 “HF Hub / T0 reproducibility is unfinished” + §P1 “Tests pass but invariant readiness is not real” + post-v1.0.1 audit re-examination (this session, 2026-05-18). The audit said explicitly: “Either finish this path or write a superseding ADR that explicitly waives it for submission” (T0 context) and “Passing pytest does not prove submission readiness under the repo’s own contract” (invariant context). v1.0.0 + v1.0.1 did neither; this ADR closes the gap via the supersession path explicitly invited by the audit.

Closing commit/ADR: v1.0.9 (Block A; Block B carryforward to v1.1.x)

Claim

REPO_AUDIT_2026-05-18 flagged two governance gaps at submission time that the v1.0.0 + v1.0.1 patches did not close: (a) the T0 score-match script body in scripts/eval_from_hub.py (per ADR-034) remains a scaffold despite HF Hub publication of BBehring/prompt-injection-{frozen-probe,lora} landing at v1.0.1; (b) 38 invariant-test stubs in tests/test_invariants.py remain skip-marked despite ADR-039 gate 3 explicitly requiring them unskipped + green at the submission tag. Per the project anti-pattern “Mutating a locked decision without writing a superseding ADR”, these unmet contracts require explicit documentation. This ADR narrowly supersedes both — preserving the rest of ADR-034 (T1 + T3 tiers unchanged) and the rest of ADR-039 (gates 1, 2, 4, 5, 6 unchanged) — and codifies the carryforward to v1.1.x with explicit landing conditions. Block A — ADR-034 T0 score-match wiring carryforward. ADR-034 defines T0 reproducibility as “external party reproduces our scores from our predictions” via huggingface_hub.snapshot_download(repo_id) → load via AutoModelForSequenceClassification.from_pretrained(local_path) → CPU inference against the local val slate → score-match against evals/results.json within 1e-4 absolute tolerance. At v1.0.1 the publish half lands cleanly (both rungs live on HF Hub; auto-generated model cards visible; reviewer URLs 200). The script body remains a scaffold that prints a “non-dry-run body not yet implemented; see WRITEUP/reproducibility.md T0 maintainer note + ADR-051” message and exits 2. The scaffold is ~100 LOC away from full wiring; the deferral is operational (the v1.0.x release cadence is submission-readiness, not methodology-completeness), not methodological. WRITEUP/reproducibility.md documents this explicitly. v1.1.x landing condition — make eval-from-hub RUNG=frozen-probe + RUNG=lora exit 0 with score-match summary within 1e-4 tolerance per ADR-034 spec. Block B — ADR-039 gate 3 invariant-scaffold unskip carryforward. ADR-039 gate 3 says “All tests/test_invariants.py stubs unskipped + green; every @pytest.mark.skip decorator removed; pytest -m unit exits clean. Verified via pytest -m unit tests/test_invariants.py + pytest --collect-only shows zero skipped tests.” At v1.0.1 the test file ships 10 implemented invariants (data + dedup + leakage + scorer-schema + calibration-battery; all pass; sourced from Phase 1-4 artifacts) + 38 scaffold stubs that retain @pytest.mark.skip(reason="v1.0.0 carryforward stub — see module docstring; deferred to v1.1.x") markers. The module docstring catalogues all 38 stubs into 3 buckets: spec-invariant scaffolds (covered-by-code-construction; the executable assertion is the better-discipline form), reporting invariants (covered by pre-commit hook or manual review at v1.0.x; executable test is the better form), and ADR-050-orphaned invariants (full-FT OOD + LLM judges no longer in slate; the invariants describe targets that don’t exist). The carryforward rationale is that ADR-039 gate 3 as written is too strict for a Phase-0-locked submission ladder — Phase 0 cannot fully predict which invariants Phase 1-5 will materialise; the v1.0.x tag scopes the spec discipline (every locked decision is an ADR; every assumption is in the registry) and defers the full executable-invariant-suite to v1.1.x. v1.1.x landing condition — pytest -m unit tests/test_invariants.py returns 48 passed / 0 skipped (or N skipped each with an explicit ADR-numbered exemption reason that survives audit). Combined consequences. SUBMISSION_AUDIT.md gains an entry for ADR-051 noting the narrow supersession scope. The reviewer URL at v1.0.0/v1.0.1/v1.0.2 stays valid; live Quarto site reflects the latest tag with the ADR-051 supersession surfaced in the decisions/ index. WRITEUP/reproducibility.md T0 maintainer note already references the forthcoming ADR-051; that note becomes accurate-by-construction once this ADR lands. No code or methodology changes ship with ADR-051 — it is governance only. No other ADR is affected.

Acceptance criterion

decisions/ADR-051-v1.0.x-carryforward-of-t0-and-invariant-scaffolds.md exists at this path with Accepted status; SUBMISSION_AUDIT.md regenerates via scripts/regenerate_audit.py with ADR-051 included; ADR-034 and ADR-039 frontmatter both gain a superseded_by entry referencing ADR-051 (narrow supersession on the two specific axes; T1 + T3 + gates 1+2+4+5+6 unchanged); decisions/README.md ADR index includes ADR-051; WRITEUP/reproducibility.md T0 maintainer note cross-references this ADR (already drafted at v1.0.1 with the anticipatory reference); CHANGELOG [1.0.2] entry summarises the supersession; v1.0.2 tag + GH release page resolve at HTTP 200.

CLAIM-052 - ADR-052: Full-FT OOD inference drop — methodological reframing; narrow supersession of ADR-050 Revision 2 (FUSE-crash-only framing)

Status: Accepted

Source: Post-v1.0.2 review of REPO_AUDIT_2026-05-18 + user’s cover-letter draft language (draft.md) that framed the full-FT decision as methodological judgment rather than operational forced-drop. The cover-letter version is the more honest read of the decision-not-to-push-through; ADR-050 R2’s operational-only framing under-states the methodological reasoning that decided the post-crash choice. AskUserQuestion 4-Q batch 2026-05-18 #N+2 Q1 user-locked the methodological- load-bearing-with-crash-as-trigger framing.

Closing commit/ADR: v1.0.3

Claim

ADR-050 Revision 2 attributed the Phase 5 full-FT OOD inference drop solely to an X11 FUSE EIO crash on /workspace MooseFS storage (operational forced-drop framing). This ADR narrowly supersedes Revision 2 to reframe the drop as methodologically load-bearing + operationally triggered: the rung-ladder + paired-bootstrap CI inspection at LoRA results had already established that further fine-tune investment on the v1.0.x training scope was unlikely to pay off; the FUSE crash was the proximate trigger that exposed a decision the methodology already pointed toward. Revision 1 (LLM-judge cost drop) is unchanged. The vendor_black_box contamination tier still carries 0 rungs in this submission. ADR-050 Revision 2’s operational facts remain accurate; the load-bearing reason is restated. Methodological reasoning that decided the post-crash choice. Phase 4 paired-bootstrap analysis showed LoRA’s pooled_ood AUPRC delta vs frozen-probe at -0.071 (CI clears zero; paired_cells.parquet seed=1 + seed=2 stability check 0/40 cells flagged) — fine-tuning the head onto the LODO direct-injection training pool was actively HURTING OOD generalization relative to leaving the pretrained ModernBERT-base embeddings intact. Full-FT (full backbone trainable; ~149M parameters trainable vs LoRA’s ~1.5M) was a larger version of the same fine-tuning mechanism that LoRA had just shown to be net-harmful on OOD. Expected marginal benefit of full-FT-OOD over LoRA-OOD on the same training pool: low. Cost of re-firing full-FT OOD inference on a Low-stock A100 80GB: ~6-12 hours wall + repeat-FUSE-risk + ~$5-12 GPU spend + operator approval gates per ADR-020. Cost-benefit lands on drop. Retrospective self-awareness on full-FT LODO investment. With the data-set sizes used (~4.7K positives + ~17K benigns post-dedup, no augmentation) and the paired-bootstrap evidence that LoRA → frozen-probe delta is negative on OOD, the rung-ladder + CI inspection now suggests the full-FT LODO investment itself was likely not load-bearing for the characterisation conclusions of this submission. A v1.1.x iteration with a larger training pool (e.g., augmentation; cross-source injection-style addition) might revisit full-FT as the rung where the additional parameter budget starts to pay; the v1.0.x scope locks full-FT in LODO (per the original ADR-019 commitment that produced the 24 Phase 2 LODO predictions) and drops full-FT from OOD. The FUSE crash is the proximate operational event that exposed the decision-not-to-push-through; the methodological reasoning above is the load-bearing justification. Consequences. Reviewer-facing surface: WRITEUP §8.1 full-FT bullet rewritten to lead with the methodological reasoning + acknowledge the FUSE crash as trigger. WRITEUP/model-rungs.md §4.3 full-FT Note paragraph re-anchored to the same framing. README does not surface ADR-052 directly (the headline finding is unchanged: LoRA + frozen-probe + 2 ProtectAI = 5-rung OOD slate). Governance-trail: ADR-050 frontmatter gains superseded_by: [ADR-052] on the Revision 2 axis only; Revision 1 (LLM-judge cost drop) is untouched.

Acceptance criterion

decisions/ADR-052-…md exists at this path with Accepted status; decisions/ADR-050-…md frontmatter gains superseded_by: [ADR-052] (narrow — Revision 2 axis only); WRITEUP/limitations-and-future-work.md §8.1 full-FT bullet rewritten to lead with methodological reasoning + acknowledge FUSE crash as operational trigger; WRITEUP/model-rungs.md §4.3 full-FT Note paragraph re-anchored to the same framing; SUBMISSION_AUDIT.md regenerates via scripts/regenerate_audit.py with ADR-052 included; CHANGELOG [1.0.3] entry summarises the supersession.

CLAIM-053 - ADR-053: Reading-guide governance + newcomer onboarding paths — two entry artifacts + 3 reading paths + Headline-finding-block + interpretation pedagogy + pointer convention

Status: Accepted

Source: transcripts/2026-05-18__phase-12-04-reading-guide-governance.md (private; emailed to reviewer separately at submission per ADR-029)

Closing commit/ADR: v1.0.4

Claim

The Quarto site has two entry artifacts with distinct roles: EXECUTIVE_SUMMARY.md (1-page decision-maker layer; added at v1.0.3 per NEXT_STEPS §1.7) and index.qmd (reviewer-landing reading guide). Until v1.0.4 neither was anchored in an ADR — ADR-030 locks Quarto HTML as the deliverable but not reading- flow architecture; ADR-033 locks release-strategy URLs but not landing-page content. v1.0.4’s stale-content audit surfaced that index.qmd had drifted into Phase-0-time scaffolding language: stale ADR count (34+), stale make commands, stale Status section (“spokes are skeletons; Phase 5 populates them”), silent on the actual headline finding. This ADR retroactively + prospectively anchors the reading-guide architecture in 5 governance dimensions so the drift cannot recur without an explicit superseding ADR. Decision (5 governance dimensions): (1) Two entry artifacts with distinct roles — EXECUTIVE_SUMMARY for decision-makers + time-constrained reviewers; index.qmd for reviewers clicking into the live site root. (2) Three reading paths canonical — A1 Quick-skim (~15 min) + A2 Audit (~60 min) + A3 Reproduce (~30 min CPU). (3) Headline-finding-block-on-index is required — index.qmd must state the headline numbers up-front, not bury them behind “see WRITEUP §5” pointers. (4) Interpretation pedagogy on index.qmd is required — reviewers shouldn’t have to assemble the framing (prevalence baseline, cross-family OOD, negative-delta meaning, non-monotone versioning, threshold transfer) from spokes. (5) Pointer convention — index.qmd → EXECUTIVE_SUMMARY → WRITEUP → spokes → ADRs. Retroactive ADR coverage: EXECUTIVE_SUMMARY.md was added at v1.0.3 per NEXT_STEPS §1.7 alone (no ADR) — ADR-053 covers its role retroactively. NEXT_STEPS.md §1.7 gets a back- reference to ADR-053 in the v1.0.4 patch. Driver: user question “does the reading guide clearly say what the final results were? is it organized in a way that makes sense to someone coming to the project. Does it conform to our initial guidance and/or does our ADRs need to be enriched?” (2026-05-18). The honest answer at v1.0.3 was NO (silent on results) + YES (ADRs need enrichment). This ADR is that enrichment + the v1.0.4 index.qmd rewrite is the implementation.

Acceptance criterion

At v1.0.4 close, index.qmd contains: (a) a “Results” section with the 3-row pooled_ood AUPRC trio + CI bounds sourced from evals/bootstrap/marginal_cells.parquet; (b) a “How to read these numbers” section with 5 interpretation patterns (prevalence baseline, cross-family vs cross-source, negative LoRA delta, ProtectAI non-monotone, val→LODO threshold transfer); (c) all 3 reading paths labeled A1/A2/A3; (d) a “Headline ADRs to read” curated sub-list in the A2 audit path; (e) Status section anchored in v1.0.4 reality (not Phase-0-time scaffolding); (f) pointer at EXECUTIVE_SUMMARY as A1 step 1. EXECUTIVE_SUMMARY.md keeps its 1-page scope + retroactive ADR-053 coverage referenced in NEXT_STEPS §1.7. Both files render via _quarto.yml render allowlist + appear in the sidebar.

CLAIM-054 - ADR-054: RESULTS.md as third entry artifact — narrow supersession of ADR-053 dimension 1 (two-entry-artifacts); dimensions 2-5 unchanged

Status: Accepted

Source: transcripts/2026-05-18__phase-12-05-results-page-and-badges.md (private; emailed to reviewer separately at submission per ADR-029)

Closing commit/ADR: v1.0.5

Claim

ADR-053 (landed at v1.0.4 — one commit before v1.0.5) locked the Quarto reading-guide architecture in 5 governance dimensions, the first of which was “two entry artifacts: EXECUTIVE_SUMMARY.md (1-page decision-maker layer) + index.qmd (reviewer-landing reading guide)”. Post-v1.0.4, user feedback identified that the actual model-run results were either missing from the Quarto site or hard to find: the 5-rung × 5-slice AUPRC + AUROC + recall@FPR1% grid was never tabulated on the rendered site (only the 3-row pooled_ood trio appears on index.qmd); the 7 Phase 4 figures (docs/plots/F*.svg) were never embedded in any rendered page; and the raw-data parquets in evals/ were reachable only via repo clone or manual GitHub navigation. The reading-guide architecture had an artifact-discovery gap. Decision: add a third entry artifact RESULTS.md with a distinct role from the existing two — data-disclosure / artifact-discovery (vs EXECUTIVE_SUMMARY = thesis-distillation and index.qmd = reviewer-orientation). Narrowly supersede ADR-053 dimension 1 only (“two entry artifacts” becomes “three entry artifacts”); dimensions 2-5 (3-path canonical order + Headline-finding-block requirement + interpretation pedagogy requirement on index.qmd + pointer convention) are unchanged. RESULTS.md scope: (1) full 5×5 AUPRC grid with N/A markers in single-class cells per ADR-050; (2) AUROC cross-paper diagnostic at same shape; (3) recall@FPR1% policy-relevant grid; (4) embedded docs/plots/F1-F7.svg figures with provenance; (5) raw-data table with GitHub blob URLs at tree/v1.0.5 for every parquet + JSON in evals/; (6) reproducibility commands mirroring the index.qmd T0/T1/T3 tier table (DRY). Sidebar placement under the “Reading guide” section as the third entry, after EXECUTIVE_SUMMARY + index.qmd, so the cold reviewer arc (thesis → orientation → results) is one sidebar click each. Pointer convention extension: index.qmd Results section + EXECUTIVE_SUMMARY reading-path + WRITEUP §Results all cross-reference RESULTS.md as the canonical artifact- disclosure page. Interpretation pedagogy stays on index.qmd (ADR-053 dimension 4) and is NOT duplicated on RESULTS.md — RESULTS is for the reader who already knows they want the numbers; index.qmd is for the reader who needs the framing.

Acceptance criterion

At v1.0.5 close, RESULTS.md exists at repo root + renders to _site/RESULTS.html + appears in the Quarto sidebar under “Reading guide” as the third entry after EXECUTIVE_SUMMARY + index.qmd. It contains: (a) §1 full 5-rung × 5-slice AUPRC grid with N/A markers; (b) §2 5×5 AUROC diagnostic; (c) §3 5×5 recall@FPR1% policy grid; (d) §4 7 embedded figures (F1-F7) with provenance footers; (e) §5 raw-data table linking every artifact in evals/ to GitHub at tree/v1.0.5; (f) §6 reproducibility tier mirror. ADR-053 frontmatter is edited in-place to add superseded_by: [ADR-054] per the established convention (ADR-050 had its frontmatter edited when ADR-052 narrowly superseded R2; same pattern here). index.qmd + EXECUTIVE_SUMMARY + WRITEUP §Results all carry pointers to RESULTS.md.

CLAIM-055 - ADR-055: eval-toolkit installs from PyPI (not git+https) at v0.40.0+ — narrow supersession of ADR-036 “git URL is the only viable spec format”

Status: Accepted

Source: transcripts/2026-05-19__phase-13-pypi-install-and-platt-beta.md (private; emailed at submission)

Closing commit/ADR: v1.0.8

Claim

ADR-036 §Context claims “All three are own-authored on GitHub (not on PyPI) — git URL is the only viable spec format.” This premise is false for eval-toolkit as of v0.40.0 (released 2026-05-18 via Trusted Publishing OIDC at https://pypi.org/project/eval-toolkit/). v1.0.8 switches the eval-toolkit pin from git+https://github.com/brandon-behring/eval-toolkit@v0.39.0 to the PyPI install spec eval-toolkit==0.40.0 (PEP 508 version specifier). uv.lock continues to provide byte-level reproducibility via SHA-256 of the PyPI wheel artifact. runpod-deploy + research_toolkit retain git+https tag-pin per ADR-036 (runpod-deploy v1.1.0 may also switch per separate v1.1.0 governance; research_toolkit not yet on PyPI). The remainder of ADR-036 — tag-pin convention, freeze-for-submission- window discipline, 3 bump triggers (blocking-bug + critical-security + reviewer-feedback-patch), uv.lock byte-level backstop — is unchanged.

Acceptance criterion

At v1.0.8 close, pyproject.toml specifies eval-toolkit==0.40.0 (PyPI install) not eval-toolkit @ git+https://...@v0.39.0. uv pip show eval-toolkit reports Version: 0.40.0 + Location: .venv/lib/python3.13/site-packages/ (not git-clone source). uv.lock source line shows registry = "https://pypi.org/simple" not git = "https://github.com/...". 171+/171+ smoke tests pass post-bump. ADR-036 frontmatter superseded_by updated to ["055"] in-place per ADR-029 convention.

CLAIM-056 - ADR-056: Calibration battery refactor to eval-toolkit _binary API + Platt + Beta calibrators landed — narrow supersession of ADR-023 “temperature + isotonic only” scope deferral

Status: Accepted

Source: transcripts/2026-05-19__phase-13-pypi-install-and-platt-beta.md (private; emailed at submission)

Closing commit/ADR: v1.0.8

Claim

ADR-023 §Decision deferred Platt + Beta calibrators as out-of-scope (“temperature + isotonic suffice for the audit”) because eval-toolkit didn’t ship them in scalar-prob form at v0.31.0 lock time. v0.40.0 shipped fit_platt_binary + fit_beta_binary (eval-toolkit#43; filed v1.0.6; closed v1.0.8 within ~17 min of filing). v1.0.8 lands the full 4-calibrator binary battery (temperature + isotonic + Platt + Beta) uniformly on the eval-toolkit _binary API family with the canonical (params_tuple, apply) return shape. The src/eval/calibration_battery.py refactor: (a) replaces fit_temperature(val_logprobs, y_val) (multi-class log-prob API; what we missed at earlier pin bumps) with fit_temperature_binary(y_true, y_score) (v0.35.0; scalar-prob sibling); (b) adds fit_platt_binary + fit_beta_binary (v0.40.0); (c) wraps fit_isotonic_calibrator in local adapter fit_isotonic_binary_local returning (None, apply) for shape consistency until eval-toolkit#44 ships native fit_isotonic_binary; (d) deletes hand-rolled proba_to_logprobs + apply_temperature helpers (now duplicated by upstream’s internal apply callable); (e) extends CalibratorBundle NamedTuple with platt_params, test_scores_platt, beta_params, test_scores_beta fields.

Acceptance criterion

At v1.0.8 close, src/eval/calibration_battery.py imports fit_temperature_binary + fit_platt_binary + fit_beta_binary from eval_toolkit.calibration; fit_temperature (multi-class) not imported; proba_to_logprobs + apply_temperature deleted from the module. CalibratorBundle NamedTuple has 7 fields (temperature_T + test_scores_temperature + test_scores_isotonic + platt_params + test_scores_platt + beta_params + test_scores_beta). 7/7 smoke tests in tests/smoke/test_calibration_battery_smoke.py pass (proba_to_logprobs + apply_temperature tests removed; new test_fit_and_apply_calibrators_returns_bundle_with_4_calibrators added). Local fit_isotonic_binary_local adapter present + removal trigger documented (when eval-toolkit#44 ships). 167/167 broader smoke suite green (171 - 4 deleted = 167). ADR-023 frontmatter superseded_by updated to ["056"] in-place per ADR-029 convention.

CLAIM-057 - ADR-057: Per-prediction provenance manifests via scripts/backfill_provenance.py — schema, location, naming, idempotency

Status: Accepted

Source: transcripts/2026-05-19__phase-13-pypi-install-and-platt-beta.md (private; emailed at submission)

Closing commit/ADR: v1.0.8

Claim

v1.0.8 closes NEXT_STEPS §1.9 (Manifest backfill pipeline) by adding scripts/backfill_provenance.py + emitting per-prediction provenance JSON at evals/manifests/<rung>__<fold>__<seed>__<slice>.json for each of 282 prediction parquets. Schema captures git_sha + config_hash + contamination_flag (ADR-005 three-state taxonomy) + rung + fold + seed + slice + n_rows + predictions_relpath + generated_at_utc. Provenance lives in sibling JSON, not in the parquet columns themselves — minimizes risk of corrupting the 282-file artifact set + decouples provenance bump from parquet regen. Per- prediction shape (vs single rolled-up) chosen per /exploring-options batch 11 Q1 lock (matches upstream manifest.v3 fine-grained convention; reviewer can audit any single cell). Three filename patterns supported: trained-with-tail (transformer per-slice + per-epoch), trained-no-tail (classical floor LODO cells), reference (ungridded protectai-v1/v2). Idempotent: re-running on same git SHA + same configs produces byte-identical manifest content (modulo generated_at_utc; that field is documented as the regen-stamp).

Acceptance criterion

At v1.0.8 close, scripts/backfill_provenance.py exists + make backfill-provenance Makefile target invokes it. Running python scripts/backfill_provenance.py emits 282 manifest.json files under evals/manifests/. Running --check mode reports OK: all 282 manifests present + exits 0. Each manifest carries the 8 required fields (schema_version + adr_ref + generated_at_utc + git_sha + config_hash + contamination_flag + rung + n_rows + predictions_relpath) + optional fold + seed + slice_name + epoch per the parquet’s filename pattern. Contamination flags map cleanly to the 3-state taxonomy (vendor_black_box tier carries 0 entries per ADR-050 R1). decisions/library_imports.md notes the new script; CHANGELOG [1.0.8] carries the §1.9 closure narrative.

CLAIM-058 - ADR-058: scripts/eval_from_hub.py non-dry-run body wired — narrow supersession of ADR-051 Block A (T0 score-match carryforward); Block B (38 invariant scaffolds) remains carryforward

Status: Accepted

Source: transcripts/2026-05-19__v1-0-9-adr-051-block-a-close.md (private; emailed at submission)

Closing commit/ADR: v1.0.9

Claim

ADR-051 narrowly superseded ADR-034 §Tier T0 + ADR-039 §gate 3 via two carryforward blocks with explicit v1.1.x landing conditions. v1.0.9 closes Block A — the T0 score-match wiring at scripts/eval_from_hub.py — by replacing the stub that exited 2 with the full body: huggingface_hub.snapshot_download(repo_id) → architecture-dispatched load (AutoModelForSequenceClassification.from_pretrained for frozen-probe + full-ft; base ModernBERT + PeftModel.from_pretrained for lora) → CPU inference via library-first reuse of src.training.train_modernbert._predict_proba → per-row score-match against the committed reference predictions parquet at evals/predictions/<rung>__fold0__seed42__<slice>.parquet within 1e-4 absolute tolerance per ADR-034 §Tier T0. Strict-mode exit code per /exploring-options 2026-05-19 Q1 lock: exit 1 on any row exceeding tolerance (no silent failures); exit 0 on all-pass; exit 2 on invalid args (rung not in evals/results.json::published_rungs, reference parquet missing). Smoke coverage at tests/smoke/test_eval_from_hub_smoke.py (9 tests; mocked-only per Q6 lock — no real HF Hub fetch in CI). Block B (38 invariant scaffolds in tests/test_invariants.py) remains carryforward to v1.1.x per ADR-051 — Block B is orthogonal to T0 wiring and out of scope for v1.0.9. The narrow supersession scope of this ADR is Block A only.

Acceptance criterion

At v1.0.9 close, scripts/eval_from_hub.py non-dry-run path executes the full wiring (snapshot_download → load → CPU inference → per-row score-match → emit predictions parquet) instead of returning 2 with a carryforward message. make eval-from-hub RUNG=frozen-probe and RUNG=lora exit 0 with score-match summary within 1e-4 tolerance per ADR-034 §Tier T0 §Score-match contract. uv run pytest tests/smoke/test_eval_from_hub_smoke.py -v reports 9/9 passing. uv run mypy --strict scripts/eval_from_hub.py returns 0. ADR-051 frontmatter superseded_by updated to ["058"] in-place per ADR-029 convention (narrow Block A only; Block B remains carryforward).

CLAIM-059 - ADR-059: runpod-deploy installs from PyPI (not git+https) at v0.8.4+ — narrow supersession of ADR-036 “git URL is the only viable spec format” sub-claim for runpod-deploy

Status: Accepted

Source: transcripts/2026-05-19__v1-1-0-runpod-deploy-and-deberta.md (private; emailed at submission)

Closing commit/ADR: v1.1.0

Claim

ADR-036 §Context claims “All three are own-authored on GitHub (not on PyPI) — git URL is the only viable spec format.” This premise was already false for eval-toolkit (closed by ADR-055 at v1.0.8). It is now also false for runpod-deploy as of v0.7.1 (Trusted Publishing OIDC live on PyPI; v0.7.7 through v0.8.4 all published as wheels). v1.1.0 switches the runpod-deploy pin from runpod-deploy @ git+https://github.com/brandon-behring/runpod-deploy@v0.7.7 to the PyPI install spec runpod-deploy==0.8.4 (PEP 508 version specifier; mirrors the ADR-055 pattern for eval-toolkit). uv.lock continues to provide byte-level reproducibility via SHA-256 of the PyPI wheel artifact. research_toolkit retains git+https tag-pin per ADR-036 (not yet on PyPI; will get its own narrow ADR when published). The remainder of ADR-036 — tag-pin convention, freeze-for-submission-window discipline, 3 bump triggers (blocking-bug + critical-security + reviewer-feedback-patch), uv.lock byte-level backstop — is unchanged. v0.8.x also brings a load-bearing schema break: legacy stop: {on_success, on_failure} was REMOVED in v0.8.3; the 3 affected headline-*.yaml configs were migrated to lifecycle: schema in Commit 1 of 3 BEFORE this pin bump in Commit 2 of 3. Three /exploring-options 2026-05-19 execution-level locks govern this commit: Q3 (narrow scope; mirror ADR-055), Q4 (auto-continue-on-green-CI; v1.0.9 CI green at 4m42s 2026-05-19T02:03Z triggered Phase B start), Q5 (3 logical sub-commits as load-bearing audit trail).

Acceptance criterion

At v1.1.0 close, pyproject.toml specifies runpod-deploy==0.8.4 (PyPI install) not runpod-deploy @ git+https://...@v0.7.7. uv pip show runpod-deploy reports Version: 0.8.4 + Location: .venv/...site-packages/ (not git-clone source). uv.lock entry shows registry = "https://pypi.org/simple" not git = "https://github.com/...". make headline-dry-run (validate –all on each of the 3 migrated configs) exits 0 with [image-registry] ok per #97 consumption. The runpod_deploy_long_ssh.py shim is DELETED in the same commit per no-orphaned-code invariant; budget.ssh_ready_timeout_sec: 600 override appears in all 3 configs per #88 consumption. ADR-036 frontmatter superseded_by updated to ["055", "059"] in-place per ADR-029 convention.

CLAIM-060 - ADR-060: DeBERTa-v3-base medium ablation methodology — chunk-and-average vs head-truncation × 5-slice OOD; infrastructure-only at v1.1.0; execution deferred to v1.1.1

Status: Accepted

Source: transcripts/2026-05-19__v1-1-0-runpod-deploy-and-deberta.md (private; emailed at submission)

Closing commit/ADR: v1.1.0 (methodology); v1.1.1 (execution carryforward)

Claim

NEXT_STEPS §1.10 (DeBERTa-v3-base long-context ablation) was explicitly tagged v1.1.x at submission. /exploring-options 2026-05-19 batch 9 Q1 locked the medium ablation scope: train DeBERTa-v3-base once (1 fold + 1 seed; LODO not full grid); apply 2 truncation strategies (chunk-and-average over 512-token windows + head-truncation); evaluate on the full 5-slice OOD slate (BIPIA + InjecAgent + JBB-Behaviors + XSTest + NotInject); ablation-appendix framing in RESULTS §1B (NOT integrated as 6th rung in the headline ladder, per §1.10 literal). Approximate compute envelope: 1×L4 or 1×A100; ~30 min wall per training fire; ~$8-10 GPU. /exploring-options 2026-05-19 Q2 lock chose sequential single-pod 2-fire via lifecycle.on_success: recycle (#90 closure consumption; saves ~$1-2 + ~3-5 min per fire vs full teardown between truncation strategies). Scope-mismatch discovery (2026-05-19 mid-execution): the existing training pipeline (src/training/train_modernbert.py + src/training/load_modernbert.py) is ModernBERT-specific by construction; the MODERNBERT_BASE_HF_ID constant is hard-coded in the loader. Adding DeBERTa requires loader-refactor + trainer-extension + chunk-and-average windowed-inference code + eval-pipeline integration — ~4-6h infrastructure work BEFORE any GPU fire. The v1.1.0 plan estimated B5 at ~2.5h on the assumption the trainer was generic; it isn’t. Per /exploring-options 2026-05-19 Path B lock, v1.1.0 lands the methodology + infrastructure scaffold (this ADR + configs/rungs/deberta_v3_base.yaml + configs/runpod/headline-deberta.yaml + Makefile target stubs + RESULTS §1B placeholder); execution defers to v1.1.1. The methodology lock is binding regardless of when execution lands.

Acceptance criterion

At v1.1.0 close, this ADR exists with status “Accepted (methodology lock — infrastructure landed; execution deferred to v1.1.1)”. configs/rungs/deberta_v3_base.yaml exists with hyperparameter recipe (backbone, fold/seed scope, truncation strategy switch, batch sizes, epoch count). configs/runpod/headline-deberta.yaml exists with lifecycle.on_success: recycle + budget.ssh_ready_timeout_sec: 600 + image pin + cost cap per ADR-020 soft $125 per-job. Makefile targets train-deberta-v3, eval-deberta-v3, deberta-ablation exist as stubs that exit 2 with a “v1.1.1 execution carryforward” message + pointer to this ADR. RESULTS.md gains a §1B placeholder section with the methodology lock + planned scope + “Results pending v1.1.1 GPU fire” line. WRITEUP/limitations-and-future-work.md §9.2 documents the deferred execution. NEXT_STEPS.md §1.10 status → “methodology landed at v1.1.0 (ADR-060); execution v1.1.1”. decisions/upstream_issues.md references ADR-060 in the #90 row (the deferred consumer-site for lifecycle.on_success: recycle). SUBMISSION_AUDIT.md regenerates clean with 60 CLAIM rows. v1.1.1 landing condition: make deberta-ablation exits 0 with per-truncation AUPRC/AUROC entries in evals/metrics/per_cell_deberta.parquet; RESULTS §1B placeholder replaced with real numbers; ~$5-7 GPU spend recorded in evals/cost_ledger.csv (within ADR-020 envelope).

CLAIM-061 - ADR-061: Quarto site navigation restructure — landing-page rebuild + navbar consolidation + sidebar hub-spoke nesting (narrow supersession of ADR-053 navigation contract)

Status: Accepted

Source: transcripts/2026-05-19__v1-1-1-quarto-clarity-restructure.md (private; emailed at submission)

Closing commit/ADR: v1.1.1

Claim

User feedback 2026-05-19 surfaced a discoverability problem on the live Quarto site: “the quatro documents they seem really confusing and hard to follow, the whole points was them to be a cleaner version. … it isn’t immdiately clear to me where to find the results and explanations in clear language about wha they mean.” Explore-agent audit confirmed the root cause: although the headline results table is on the index.qmd landing page (lines 18-32) and the 5 interpretation patterns are there (lines 36-50), the navbar’s 9-item top-level fragmentation (Executive summary / Reading guide / Results / Methodology (TOC) / Spokes / Notebooks / Evidence / Reference / Decisions) buries them — and “Methodology (TOC)” + “Spokes” appear as peer items when actually the spokes are CHILDREN of the hub. The user perceived a “GitHub-blob-view of WRITEUP.md has more than the Quarto site” inversion (Quarto is actually the SUPERSET — hub + 8 spokes ~1,449 lines vs WRITEUP.md alone 292 lines), but the underlying UX problem is real regardless. v1.1.1 restructures the navigation: navbar 9 → 5 items (Results / Methodology dropdown / Decisions dropdown / Reference dropdown / Repo); sidebar nests 8 spokes under WRITEUP.md via Quarto sub-sections; index.qmd rebuilt to ~30 lines (results + 5-bullet plain-language meaning + 3 obvious drill-down links) with the displaced reading-guide content moved to a new READING_GUIDE.md page; hub-spoke signposting added (2-paragraph primer at top of WRITEUP.md + 1-line back-link at top of each of 8 spokes); README “How to read” clarified. No methodology content rewritten — pure navigation + signposting changes. ADR-053 reading-guide governance dimensions 2 (3-reading-paths) + 3 (headline-finding-block) + 4 (interpretation pedagogy) + 5 (pointer convention) all preserved; only dimension 1 (navbar/sidebar architecture) is narrowly superseded.

Acceptance criterion

At v1.1.1 close: live Quarto site landing page shows results + plain-language interpretation above the fold (3 obvious links to drill deeper). Navbar has 5 top-level items (not 9). Sidebar shows “Methodology > Detailed spokes (8 topics) > …” nesting. Each of 8 WRITEUP/*.md spoke files has a 1-line back-link to WRITEUP.md at the top. WRITEUP.md has a 2-paragraph hub-spoke primer immediately after the title. index.qmd is ~30 lines (down from 137); the displaced reading-guide content lives at the new READING_GUIDE.md. quarto render builds clean. CI green on the v1.1.1 commit. ADR-053 frontmatter shows superseded_by: ["061"] in-place per ADR-029 convention. SUBMISSION_AUDIT.md regenerates with 61 CLAIM rows. Reviewer URLs all 200 (tree/v1.0.0 unchanged per ADR-033; releases/v1.1.1 newly resolvable; live Quarto site reflects v1.1.1).

CLAIM-062 - ADR-062: Quarto writeup clarity rewrite and canonical reviewer figure slate

Status: Accepted

Source: user request 2026-05-19; implementation plan “Quarto Writeup Clarity Rewrite”

Closing commit/ADR: pending implementation commit

Claim

User feedback after the v1.1.1 navigation pass found that the Quarto/writeup was still too jargon-heavy and dense for a first-time hiring-manager reader: the problem setup, metric meaning, plot interpretation, and limits were not clear enough. This patch rewrites the first reading path problem-first, pushes ADR/process detail below the core story, and replaces the scaffolded seven-figure slate with five reviewer-facing figures rendered only from canonical eval artifacts. Synthetic scaffold plots remain available only for smoke tests outside docs/plots.

Acceptance criterion

index.qmd, EXECUTIVE_SUMMARY.md, WRITEUP.md, RESULTS.md, README, READING_GUIDE, and GLOSSARY explain the problem, metrics, headline result, plots, and limitations in plain language before deep methodology detail. scripts/render_figures.py --out-dir docs/plots renders exactly F1-F5 from committed canonical artifacts under evals/; each sidecar records data_mode: canonical, ADR-062, commit SHA, generation time, and source artifact paths. --scaffold refuses to write to docs/plots and is test-only. F6/F7 scaffold figures are removed from the reviewer-facing path. Quarto site renders cleanly and smoke tests pass.

CLAIM-063 - ADR-063: DeBERTa-v3-base medium ablation execution (v1.1.2 carryforward; narrow renaming-only supersession of ADR-060 “v1.1.1” body references) — null result, backbone-dominant interpretation

Status: Accepted

Source: transcripts/2026-05-19__v1-1-2-deberta-execution.md (private; emailed at submission)

Closing commit/ADR: v1.1.2

Claim

ADR-060 (2026-05-19) landed the DeBERTa-v3-base medium ablation methodology lock at v1.1.0 with body text “execution deferred to v1.1.1”. That naming did not survive the same calendar day — the v1.1.1 slot was consumed by ADR-061 (Quarto site navigation restructure; user-feedback discoverability fix) before the DeBERTa execution could begin. DeBERTa execution carried forward to v1.1.2 per the v1.1.2-handoff doc, with the ADR-060 body text left immutable per the project’s no-amend / no-squash ADR-discipline. This ADR codifies the carryforward as a narrow renaming-only supersession of ADR-060’s “v1.1.1” body references; the methodology lock itself (single fold/seed, 2 truncation strategies, 5-slice OOD eval, ablation-appendix framing, ~$5-7 GPU envelope) is unchanged. v1.1.2 execution shipped both training fires on a single warm A100-SXM4-80GB pod via lifecycle.on_success: recycle (chunk_and_average -> recycle -> head_truncation -> explicit runpod-deploy stop). The 2-strategy ablation produced essentially identical per-slice metrics (pooled OOD AUPRC 0.2912 vs 0.2895; jbb_behaviors AUPRC 0.4855 vs 0.4890; xstest AUPRC 0.3966 vs 0.3912) — a publishable null result. By the ADR-060 confound- control interpretation, this indicates the ModernBERT advantage on the headline ladder is BACKBONE-DOMINANT, not context-window- dominant. Actual GPU spend: $1.34 (well under the $5-7 expected envelope; well under the ADR-020 $125 per-job soft cap and $200 cumulative hard cap).

Acceptance criterion

At v1.1.2 close: evals/metrics/per_cell_deberta.parquet exists with at least the 6 binary-class slice rows (2 strategies x {jbb_behaviors, xstest, pooled_ood}; single-class slices bipia + injecagent + notinject + iid correctly skipped per ADR-006 single-class-slice handling). evals/cost_ledger.csv carries the 9 pid-deberta-2026051* rows totaling $1.34. NEXT_STEPS.md §1.10 has a Status (v1.1.2) paragraph capturing the headline + null-result interpretation. CHANGELOG.md [1.1.2] block exists with the headline-result table. src/inference/windowed.py + src/training/load_backbone.py + scripts/run_deberta_ood_inference.py all committed. configs/rungs/deberta_v3_base.yaml carries the pinned revision: 8ccc9b6f36199bec6961081d44eb72fb3f7353f3 + bf16: false (DeBERTa numerical-stability fix). RESULTS §1B + WRITEUP §9.2 + library_imports.md inventory updates defer to a follow-up patch after the in-flight ADR-062 doc-rewrite commit stabilises (audit-clean separation per project no-conflated- scopes discipline). SUBMISSION_AUDIT.md regenerates with at least 63 CLAIM rows (ADRs 001 through 063).

CLAIM-064 - ADR-064: Writeup hiring-manager clarity polish + documentation-wide consistency conventions — additive layer on top of ADR-062

Status: Accepted

Source: transcripts/2026-05-19__v1-2-0-writeup-clarity-and-consistency.md (private; emailed at submission)

Closing commit/ADR: v1.2.0

Claim

Post-v1.1.3 user review (ADR-062 baseline) surfaced that the Quarto writeup needed a hiring-manager clarity pass: jargon density, plot-interpretation cues, and table-context framing were under-served. User expansion: ensure consistency across README and other documentation surfaces. [Verbatim wording redacted per ADR-074.] Three /exploring-options rounds locked: heavy/fresh pass (not light/ medium); commit doc-agent ADR-062 work as v1.1.3 baseline; DeBERTa null-result lives as RESULTS §1B callout (NOT a new F6 figure); hiring-manager landing = new standalone page; spoke density = light signpost + Result-bolding; figure refinements = prose + SVG axis-label fixes; sub-commits = logical (6 commits); full-repo audit + 3-stage release shape (v1.1.3 baseline + v1.1.4 consistency-only + v1.2.0 heavy pass); ADR-063 fact-correction via CHANGELOG postscript + this ADR §D narrative flag (lightest ceremony respecting the project ADR-discipline per CLAUDE.md “ADRs are immutable; supersede via new ADR”); markdown-link-checker pre-commit prophylaxis (CI only; no local pre-commit hook to avoid contributor friction); no companion technical-reviewer landing page (READING_GUIDE.md already serves that role). The 3-audit findings + locked decisions produce a ~5.5-hour 6-commit heavy pass on top of ADR-062’s structural rewrite. The pass adds reviewer-facing clarity polish (jargon glossing invariant + spoke skim signposts + table/plot context + DeBERTa §1B + figure caption + SVG axis-label refinements + hiring-manager landing) AND documentation-wide consistency invariants (canonical terminology table + project-ADR-discipline-immutable broken-slug-ref + cumulative-cost-figure flag for 5 prior ADRs).

Acceptance criterion

At v1.2.0 close: docs/for-hiring-managers.md exists (~250 words; 4-question format: problem / found / trust / candidate-thinking) and is reachable from the navbar Reference dropdown + the index.qmd “Read Next” section. The READING_GUIDE.md technical-reviewer path is named #technical-reviewer-path for anchor linkage from the hiring-manager page. RESULTS.md §1B between §1 Primary Table and §2 Paired Comparison carries the DeBERTa ablation per-strategy table + “what §1B says / does not say” pair + backbone-dominant interpretation; the WRITEUP/limitations-and-future-work.md:161 link to RESULTS §1B resolves cleanly. The 5 reviewer-facing figures (F1-F5) have self-documenting axis labels + sublegends (regenerated via make render-figures); RESULTS.md figure captions match the audit gap fixes (F1 random-floor explanation; F2 axis + CIs-crossing-zero cue; F3 N/A explanation; F4 subpanel mapping; F5 ECE/Brier gloss). All 8 WRITEUP/*.md spokes carry a “How to read this spoke” blockquote + **Result** bold prefixes on existing summary sentences. docs/GLOSSARY.md carries the canonical-terminology table from §C below plus 3+ new entries (confound, ablation, detector clarifier note). decisions/library_imports.md carries the deferred-from-v1.1.2-Phase-B entry for src/inference/windowed.py. NEXT_STEPS.md §1.10 status reflects v1.2.0 landing. CHANGELOG.md [1.2.0] entry exists with the 6-commit cadence narrative + ADR-064 cross-ref + corrected ADR slug refs. SUBMISSION_AUDIT.md regenerates clean with 64 CLAIM rows. CI markdown-link-checker (lychee; added at v1.1.4) passes on the v1.2.0 push.

CLAIM-065 - ADR-065: Writeup accuracy-audit methodology + Quarto callout-note convention + narrative invariants — additive layer on top of ADR-064

Status: Accepted

Source: transcripts/2026-05-19__v1-2-1-narrative-clarity-accuracy.md (private; emailed at submission)

Closing commit/ADR: v1.2.1

Claim

User feedback 2026-05-19 (post-v1.2.0 ADR-064 close): one more polish pass focused on three properties — narrative consistency (does the writeup hang together as a coherent story?), clarity (visual boxes for skim summaries; density reduction in dense spokes), and accuracy (every cited number re-verified against canonical sources). Three parallel /exploring-options rounds locked the v1.2.1 scope (rounds 3 + 5 + 6; 12 total decisions across audit-script architecture, voice + tense pass mechanics, callout-note content source, sub-commit granularity, cumulative-cost figure precision). The pass adds a programmatic numeric-claim audit (scripts/audit_writeup_numbers.py; configurable --strict / --report-only flag; CI hard-gate via .github/workflows/audit-writeup.yml), a Quarto callout-note convention for spoke Summary boxes and collapsible hyperparameter detail, narrative invariants (third-person voice; tense discipline; paragraph length cap; transition sentences), and the canonical cumulative-cost figure ($17.08, full precision $17.0807) computed from evals/cost_ledger.csv at v1.2.0 close — superseding ADR-063’s stale $9.92 figure (flagged but not computed in ADR-064 §D). Result: a closing-polish layer on v1.2.0 that ratifies reviewer-readiness via programmatic drift defense + skim-path discipline + voice consistency.

Acceptance criterion

At v1.2.1 close: decisions/ADR-065-writeup-accuracy-narrative-and-callout-conventions.md exists (this file). scripts/audit_writeup_numbers.py exists (configurable --strict default + --report-only opt-out flag; scans 4 categories: numbers / ADR slugs / version strings / URLs; ~200 LOC) and returns 0 drifts on the reviewer-facing markdown surface. .github/workflows/audit-writeup.yml exists (runs default-strict on push to main + PR + weekly schedule, mirroring the lychee CI pattern from ADR-064 §C2). docs/GLOSSARY.md carries the canonical-callout convention section (when to use :::{.callout-note} Summary vs :::{.callout-tip collapse="true"} hyperparameter detail vs :::{.callout-warning} caveats). All 8 WRITEUP/*.md spokes carry a top-of-page :::{.callout-note} Summary box (3-5 bullets distilled from existing **Result**-bolded subsection sentences). WRITEUP/model-rungs.md carries :::{.callout-tip collapse="true"} collapsible hyperparameter blocks (LoRA / TF-IDF / ProtectAI specs). All reviewer-facing markdown surfaces consistently use third-person voice (no first-person “we” / “I” / “our”) and past-tense for completed methodology actions / present-tense for invariants. decisions/library_imports.md carries the audit-script entry with the audit-tooling-not-primitive tag per the strengthened library-first invariant. CHANGELOG.md [1.2.1] entry exists with the 7-commit cadence narrative + ADR-065 cross-ref + the cumulative-cost figure propagation chain. NEXT_STEPS.md §1.10 carries a Status (v1.2.1) paragraph + a 1-line cumulative-cost-to-date footnote citing ADR-065 §E. SUBMISSION_AUDIT.md regenerates clean with 65 CLAIM rows. CI markdown-link-checker (lychee) + new audit-writeup workflow both pass on the v1.2.1 push.

CLAIM-066 - ADR-066: Library-first carryforward refactor v1.2.2 — consume 7 closed eval-toolkit primitives across 6 sites

Status: Accepted

Source: transcripts/2026-05-19__v1-2-2-library-first-refactor-and-immutability-clarification.md (private; emailed at submission)

Closing commit/ADR: v1.2.2

Claim

Post-v1.2.1 audit of decisions/upstream_issues.md surfaced that ALL 7 eval-toolkit upstream issues previously documented as “filed; awaiting upstream” (#14 plot_roc_curve, #15 plot_pareto_frontier, #16 plot_slice_metric_heatmap, #17 n_jobs for paired_bootstrap_diff, #20 generalized mde_from_ci, #21 block_bootstrap_on_folds, #22 ax= kwarg on plot_metric_bars) have CLOSED upstream and are exposed in the currently-pinned eval-toolkit==0.42.0 (verified via Python attribute check). 6 local workaround sites in src/eval/ (figures.py::render_f1_pareto + render_f2_roc_per_rung + render_f5_slice_heatmap + render_f6_lodo_breakdown; mde.py::mde_from_marginal_ci_record; cross_fold_ci.py::compute_cross_fold_ci_cell) carry TODO comments citing these now-closed issues. Per the strengthened library-first invariant (memory library_first_is_project_wide_invariant 2026-05-18: “NO local workarounds whatsoever. Primitives belong in eval-toolkit / runpod-deploy / research_toolkit as PyPI deps; missing → upstream MR BLOCKS dependent work”), this is maintenance debt to pay down. v1.2.2 performs the carryforward refactor: each of the 6 sites consumes the closed upstream primitive; local hand-rolls + unused imports are deleted in the SAME commit per the no-orphaned-code-during-refactor invariant. F1-F6 figures are re-rendered via make render-figures with visual-parity verification per the spirit-of-original threshold (ADR-locked caption discipline preserved; cosmetic upstream improvements welcomed). No methodology change; refactor only. Pattern after ADR-047 / ADR-056 / ADR-058.

Acceptance criterion

At v1.2.2 close: 6 local workaround sites in src/eval/ consume upstream primitives (et.plot_roc_curve at F2; et.plot_pareto_frontier at F1; et.plot_slice_metric_heatmap at F5; et.plot_metric_bars(ax=...) at F6 left panel; generalized et.mde_from_ci at mde_from_marginal_ci_record; et.block_bootstrap_on_folds at compute_cross_fold_ci_cell). Corresponding issue #(14|15|16|17|20|21|22) TODO comments removed via grep -nE "issue #(14|15|16|17|20|21|22)" src/eval/ → 0 hits. decisions/upstream_issues.md shows status updated to “RESOLVED in eval-toolkit v0.42.0; consumed at v1.2.2” for issues #14, #15, #16, #17, #20, #21, #22. F1-F6 figures re-rendered cleanly via make render-figures; ADR-locked caption discipline preserved per ADR-062 + ADR-064 §B4. Project glue (render_* orchestrator wrappers; MDECellModel schema; CrossFoldCICellModel schema; F5 N/A overlays; F6 reachability asterisks) stays — only inline numerical + plotting impls get replaced. eval-toolkit pin unchanged at v0.42.0 (already exposes all 7 primitives — no pin bump needed). make smoke + make site + lychee + audit-writeup + regenerate-audit-check all green on the v1.2.2 push.

CLAIM-067 - ADR-067: Narrow immutability relaxation for factual-typo fixes in cross-reference slug filenames

Status: Accepted

Source: transcripts/2026-05-19__v1-2-2-library-first-refactor-and-immutability-clarification.md (private; emailed at submission)

Closing commit/ADR: v1.2.2

Claim

The ADR-immutability rule per CLAUDE.md (“ADRs are immutable; supersede via new ADR”) was authored to protect DECISION content (methodology choices, locked numeric values, alternatives considered, prose rationale) from post-hoc revision — the audit-trail discipline that lets reviewers reconstruct what-was-thought-at-each-tag. Applying that same rule to factual-typo defects in cross-reference slug filenames (wrong slug pointing at a wrong-but-existing file in decisions/) is over-strict: fixing such defects does NOT alter decision content + does NOT break the audit trail. The current state — 14 broken slug-reference patterns documented in .lycheeignore + ADR-064 §D as “flagged-not-fixed” — imposes reader friction (404s on in-ADR cross-refs) without protecting anything load-bearing. v1.2.2 ADR-067 codifies a narrow exception: factual typos in cross-reference slug filenames MAY be corrected in-place in the affected ADR file, with the commit message citing ADR-067 + listing the per-file corrections. ALL other content (numeric values, methodology decisions, prose, alternatives considered, table data, frontmatter fields other than slug refs) remains immutable per the existing rule. v1.2.2 Commit 2 applies the relaxation to fix the 14 documented broken refs (12 with canonical-correct slug substitutes; 2 ADR-029 misattributions handled by removing the broken ref + citing CLAUDE.md directly). The 14 corresponding .lycheeignore patterns are deleted in the same commit (no longer ignored — actually working).

Acceptance criterion

At v1.2.2 close: decisions/ADR-067-immutability-clarification-and-canonical-slug-reference.md exists (this file). CLAUDE.md §“Phase 0 workflow” immutability-rule block updated with the narrow-relaxation clarification. decisions/README.md §Lifecycle updated with the same narrow-relaxation language. docs/GLOSSARY.md carries an entry for “Immutability relaxation (factual-typo class)” cross-referencing this ADR. All 14 broken slug-reference patterns flagged in .lycheeignore + ADR-064 §D resolve to canonical-correct slugs in the affected immutable ADRs (ADR-046, ADR-048, ADR-059, ADR-060, ADR-063, and any others discovered during grep -rn audit). The 14 corresponding patterns in .lycheeignore are deleted. The 2 ADR-029 misattributions (referencing non-existent “immutability ADR-029” slugs) are corrected by removing the broken ref + citing CLAUDE.md directly. Lychee CI passes on the v1.2.2 push (the previously-ignored refs now resolve cleanly without needing ignore patterns). Future PR reviews flag any in-place ADR edit that goes beyond slug-filename correction (numeric values + methodology + prose remain immutable per the existing rule).

CLAIM-068 - ADR-068: Narrow immutability relaxation for broken external references in immutable ADRs (local-filesystem paths + aspirational upstream refs)

Status: Accepted

Source: v1.2.6 link-check fix-forward (root-cause analysis began during the v1.2.5 /exploring-options 4-question walk on 2026-05-19) — user picked all 4 recommended options including “Extend ADR-067 narrow relaxation → ADR-068 + in-place fix” + “Full structural fix — ADR-068 + DOI canon + triage”

Closing commit/ADR: v1.2.6

Claim

ADR-067 established a narrow exception to the CLAUDE.md ADR-immutability rule for factual typos in cross-reference slug filenames (wrong-but-still- existing slug pointing at a wrong file in decisions/). The lychee CI introduced at v1.1.4 was non-functional from inception until v1.2.4 (caught by the v1.2.3 patch + the additional v0.23.0 --base incompat fixed at v1.2.4); the FIRST end-to-end lychee scan at v1.2.4 surfaced TWO additional classes of broken markdown links inside immutable ADRs that are NOT covered by ADR-067’s narrow exception but which exhibit the same audit-rationale (no decision content affected; CI can never resolve them; reader experience strictly improved by in-place fix): (a) local-filesystem path references — markdown links pointing at paths under /home/<author>/... or ../../../.claude/... (author’s own machine’s filesystem leaking into committed text) — cannot resolve on any non-author machine, including CI runners and reviewer browsers; (b) aspirational upstream references — markdown links pointing at upstream resources (eval-toolkit/blob/main/docs/methodology/.md) that do NOT exist upstream (author’s mental model of planned-but-never- created upstream docs). ADR-068 extends the narrow-relaxation rule to cover both classes with the SAME §B2-style out-of-scope-list discipline (numeric values + methodology + prose + alternatives + non-slug frontmatter remain immutable). v1.2.6 applies the rule to fix 2 immutable-ADR markdown links (ADR-065 lines 108 + 122) plus the YAML frontmatter list-item in ADR-025 line 13. Mutable files (SPEC_SHEET.md + SPEC_GREENFIELD.md + WRITEUP/.md + decisions/ library_imports.md + CHANGELOG.md) are fixed in a separate commit without ADR-068 coverage (the rule applies ONLY to immutable ADRs).

Acceptance criterion

At v1.2.6 close: decisions/ADR-068-immutability-narrow-relaxation-for-broken-external-references.md exists (this file). CLAUDE.md §“Phase 0 workflow” immutability-rule block updated to cite both ADR-067 + ADR-068 as the two narrow exceptions. decisions/README.md §Lifecycle updated with the same dual-citation. docs/GLOSSARY.md entry for “Immutability relaxation” updated to list both classes. All 2 immutable-ADR markdown links flagged by lychee CI (ADR-065 lines 108 + 122) are replaced with descriptive prose preserving citation intent without retaining the broken link target. ADR-025 line 13 (YAML frontmatter references: list-item pointing at non-existent eval-toolkit/blob/main/docs/methodology/thresholds.md) is replaced with a descriptive marker or the closest existing upstream URL. Lychee CI on the v1.2.6 push reports the affected immutable-ADR errors RESOLVED. Future PR reviews flag any in-place ADR edit that goes beyond the enumerated narrow scope (numeric values + methodology + prose + non-slug frontmatter + decision content remain immutable per the existing rule).

CLAIM-069 - ADR-069: Narrow immutability relaxation for academic publisher-URL → DOI canonicalization in immutable ADRs

Status: Accepted

Source: v1.2.6 link-check fix-forward (root-cause analysis began during the v1.2.5 /exploring-options round) — post-execution discovery revealed 26 publisher-URL citations all in immutable ADRs; user-recommended option “Sed-based DOI canonicalization across all .md files” requires an ADR-069 cover for the immutable ADR edits.

Closing commit/ADR: v1.2.6

Claim

ADR-067 (v1.2.2) + ADR-068 (v1.2.5) established TWO narrow exceptions to the CLAUDE.md ADR-immutability rule covering (i) slug typos in cross- references + (ii) broken external references to local-filesystem paths or aspirational upstream resources. The first end-to-end lychee scan at v1.2.4 surfaced a THIRD distinct class inside immutable ADRs: publisher- URL bot-403 academic citations. 26 markdown links across 10 immutable ADRs (ADR-005, ADR-006, ADR-007, ADR-011, ADR-015, ADR-021, ADR-022, ADR-023, ADR-024, ADR-056) cite peer-reviewed methodology papers via publisher landing-page URLs (journals.sagepub.com/doi/<DOI>, tandfonline.com/doi/<DOI>, jstor.org/stable/<ID>, dl.acm.org/doi/<DOI>, ojs.aaai.org/index.php/AAAI/article/view/<ID>, researchgate.net/publication/<ID>). These URLs work for human readers (subscription / institutional / paywall access) but academic publishers actively bot-403 unauthenticated CI scans. The bot-blocking is STRUCTURAL: it will never be fixed upstream and cannot be worked around with auth in CI. The canonical academic identifier is the DOI (Digital Object Identifier), resolved via doi.org/<DOI>. The DOI resolver is the durable citation target; where lychee follows DOI redirects into publisher pages that still bot-403, .lycheeignore carries exact DOI exceptions rather than broad publisher-domain ignores. Each publisher URL embeds the DOI in its path; canonicalization is a mechanical sed transformation that: preserves citation intent EXACTLY; replaces unstable publisher-page URLs with the AUTHORITATIVE academic identifier; narrows bot-403 noise in CI to exact documented DOI exceptions; aligns with CrossRef + academic-citation best practice. ADR-069 codifies this as the THIRD narrow exception to the ADR- immutability rule, parallel to ADR-067 + ADR-068 with the same §B-style out-of-scope-list discipline. v1.2.6 applies the rule to fix 26 publisher-URL citations across the 10 affected immutable ADRs.

Acceptance criterion

At v1.2.6 close: decisions/ADR-069-immutability-narrow-relaxation-for-publisher-url-to-doi-canonicalization.md exists (this file). Publisher-URL citations across the 10 affected immutable ADRs are replaced with doi.org/<DOI> equivalents where valid (DOI embedded in each publisher URL’s path, extracted via deterministic sed pattern, except the one JSTOR stable reference without a valid 10.2307/<ID> DOI remains as its original stable URL with an exact .lycheeignore entry). Lychee CI on the v1.2.6 push reports the publisher-URL bot-403 entries resolved or covered by exact documented DOI/stable-link exceptions. Future PR reviews flag any in-place ADR edit that goes beyond the three enumerated narrow scopes (ADR-067 slug typos + ADR-068 broken external refs + ADR-069 publisher → DOI canon).

CLAIM-070 - ADR-070: Narrow immutability relaxation for render-only Markdown corrections in immutable ADRs

Status: Accepted

Source: v1.2.8 rendered-site hardening audit; Quarto warning cleanup from make site

Closing commit/ADR: v1.2.8

Claim

ADR-067, ADR-068, and ADR-069 established three narrow exceptions to the ADR immutability rule for factual defects that make locked ADRs misleading or unusable without changing their decisions. The v1.2.8 rendered-site audit surfaced a fourth distinct class: Markdown syntax that preserves the intended source prose but is parsed incorrectly by Quarto when ADRs are rendered as HTML. ADR-034 contains an outer markdown fenced example that includes innerbash fences; Markdown requires a longer outer delimiter for nested fences, so Quarto emits a fenced-div warning and can render the page incorrectly. ADR-070 permits only render-only Markdown delimiter corrections inside immutable ADRs when the visible decision text, numeric values, links, alternatives, frontmatter, and methodology are unchanged. v1.2.8 applies this once: ADR-034’s outer example fence changes from triple backticks to quadruple backticks so the existing inner code fences remain literal example text.

Acceptance criterion

At v1.2.8 close: ADR-070 exists; decisions/README.md, CLAUDE.md, and docs/GLOSSARY.md describe four narrow immutability exceptions including this render-only Markdown syntax class; ADR-034 changes only the outer fence delimiter around the existing reproducibility-spoke Markdown example; make site emits no citation-processor or fenced-div warning for the affected ADR; make audit passes after regenerating SUBMISSION_AUDIT.md.

CLAIM-071 - ADR-071: Execute the slug-sweep authorized by ADR-067 + ADR-068 + ADR-069 that was never actually completed in any prior commit

Status: Accepted

Source: 2026-05-20 audit re-verification; ADR-067 §C1 canonical mapping table + post-2026-05-20 audit extension; user request to close the documented-but-unexecuted authorization loop.

Closing commit/ADR: 37c2b32

Claim

ADR-067 + ADR-068 + ADR-069 + ADR-070 (2026-05-19, the four-ADR immutability-relaxation chain) authorized four narrow classes of in-place editorial fixes to immutable ADRs: slug typos (Class A), broken external references (Class B), publisher-URL to DOI canonicalization (Class C), render-only markdown corrections (Class D). The Class A authorization specifically said the slug substitutions would land in v1.2.2 and v1.2.6 commits, with corresponding patterns deleted from .lycheeignore. The substitution commits never landed. As of 2026-05-20 audit re-verification, 59 broken decisions/ADR-NNN-<wrong-slug>.md references still ship across 21 ADR files; ADR-067’s own §C1 substitution table contains 14 live broken markdown links in its wrong-slug column; the .lycheeignore patterns ADR-067 §C3 said would be removed at v1.2.2 are still present. This ADR closes the loop: applies the full slug-mapping table derived from ADR-067 §C1 + the post-2026-05-20 audit extension (33 patterns total mapping to ~15 canonical files) across all 22 affected ADRs; strips the local-fs path leak in ADR-040 (/home/brandon_behring/.claude/plans/twinkly-weaving-puppy.md, 3 occurrences); removes the corresponding .lycheeignore patterns; and removes the 2 ADR-029 misattribution references (which cited a non-existent “immutability ADR-029” slug — actual ADR-029 is test-marker-strategy-four-marker-ratification) per ADR-067 §C2.5 “remove without replacement” rule.

Acceptance criterion

git grep "decisions/ADR-[0-9]\{3\}-[a-z0-9-]*\.md" decisions/ADR-*.md shows zero references to non-existent ADR files (all targets resolve to existing canonical filenames). The .lycheeignore no longer contains the broken-slug patterns ADR-067 §C3 promised to remove at v1.2.2. The /home/brandon_behring/.claude/plans/ path no longer appears in any committed file. CI lychee check passes on decisions/ with the smaller ignore set.

CLAIM-072 - ADR-072: Backfill missing frontmatter fields + Alternatives sections + Status headings for ADR-051 + ADR-052 per ADR-067-style narrow-relaxation discipline

Status: Accepted

Source: 2026-05-20 audit re-verification (grep on v1.2.8 head confirmed empty closing_commit/transcript + missing sections); ADR-067-070 narrow-relaxation methodology applied to the frontmatter-backfill class.

Closing commit/ADR: 8105f37

Claim

ADR-051 + ADR-052 (both 2026-05-18 governance ADRs) ship with empty closing_commit: and transcript: frontmatter fields and no ## Alternatives Considered body section. ADR-051 also lacks a ## Status body section heading and opens with # ADR-051 — instead of # ADR-051: (template convention). The user’s 2026-05-18 self-audit (REPO_AUDIT_2026-05-18.md §P1.4) flagged ADR-049 + ADR-050 for the same class of gap, but those have since been populated (closing_commit 423c2c8 and 3b16036 respectively); the actual structural debt is in ADR-051 + ADR-052. This ADR treats the empty-frontmatter-fields gap as a fourth narrow-relaxation class adjacent to ADR-067-070 (slug typos / broken external refs / publisher-URL canon / render markdown fixes): populates the missing closing_commit values from observable tag history (ADR-051 Block A closed at v1.0.9 via ADR-058; ADR-052 introduced + closed at v1.0.3), adds the retrospective ## Alternatives Considered sections documenting the decision space at lock time, restores the ## Status body heading on ADR-051, and fixes ADR-051’s opening line per template convention.

Acceptance criterion

head -50 decisions/ADR-051-*.md shows populated closing_commit: (value: v1.0.9 (Block A; Block B carryforward to v1.1.x)) and the body has ## Status, ## Alternatives Considered sections; the opening line reads # ADR-051: ... (colon, not em-dash). head -50 decisions/ADR-052-*.md shows populated closing_commit: (value: v1.0.3) and the body has ## Alternatives Considered. scripts/regenerate_audit.py --check passes after the backfill.

CLAIM-073 - ADR-073: Collapse the four-ADR narrow-relaxation chain (ADR-067/068/069/070) into a single consolidated immutability rule with four named exception classes

Status: Accepted

Source: 2026-05-20 audit hiring-manager-curious risk finding — 4-ADR chain in 1 day reads as process-fragility for a 5-day artifact. Consolidation reduces the optic without changing the underlying authorization.

Closing commit/ADR: ba342c7

Claim

The 2026-05-19 four-ADR immutability-relaxation chain (ADR-067/068/069/070) authorized four narrow classes of in-place editorial fixes to immutable ADRs. Each ADR in the chain insists it is “not a slippery slope”; the existence of four sequential relaxations in a single day undercuts that insistence from a reader-perception standpoint. This ADR consolidates the four narrow-relaxation classes into a single canonical immutability rule with named exception classes (A through D), reducing the visible “immutability is loose” surface from four signals to one rule + four named exceptions. ADRs 067/068/069/070 remain in decisions/ as historical artifacts documenting when each class was added; the prospective rule citation moves to ADR-073.

Acceptance criterion

decisions/README.md §Lifecycle (or equivalent) cites ADR-073 as the canonical immutability rule reference; ADRs 067-070 remain reachable but are no longer cited as the prospective rule. New contributors learn one rule with four classes instead of one rule + four patches. CLAUDE.md immutability section simplified to cite ADR-073.

CLAIM-074 - ADR-074: Replace verbatim self-criticism quote in ADR-064 frontmatter claim + body context with a neutral paraphrase preserving the decision

Status: Accepted

Source: 2026-05-20 audit hiring-manager-curious risk finding — the single most embarrassing string in decisions/ sits in a publicly rendered ADR. Verbatim quote preserved privately in the transcript file (gitignored, emailed separately at submission).

Closing commit/ADR: 14f0c05

Claim

ADR-064 (“writeup-hiring-manager-clarity-and-consistency-pass”) has a title slug that explicitly contains “hiring-manager-clarity” — a curious reviewer who searches the candidate’s profile + the keyword “hiring-manager” lands on this ADR. The frontmatter claim: field (lines 9-16) and body §A Context (lines 105-110) contain a verbatim user-feedback quote that includes the phrase “doesn’t demonstrate clear thought” — an embarrassing self-assessment string surviving in an immutable record that any hiring manager wandering into decisions/ will see. The decision context (what changed and why) survives intact without the verbatim quote. This ADR replaces the verbatim quote with a neutral paraphrase per ADR-073 Class A / Class B-adjacent narrow-relaxation discipline (no decision content changes; reader experience strictly improved).

Acceptance criterion

grep "doesn't demonstrate clear thought" decisions/ADR-064-*.md returns 0 matches. ADR-064’s claim + body context narrate the decision provenance without the verbatim self-criticism phrase. ADR-064’s actual decisions (B1-B6, C1-C2, D1, E1-E5) remain unchanged. The transcript file (private) preserves the original verbatim user feedback for audit-trail completeness.

CLAIM-075 - ADR-075: Unify the ADR-050 Revision 2 (FUSE-crash-forced-drop) and ADR-052 (methodology-load-bearing-with-crash-as-trigger) framings into a single account

Status: Accepted

Source: 2026-05-20 audit hiring-manager-curious finding — the ADR-050 → ADR-052 same-day reason-swap erodes reader trust in the rest of the ADR corpus. Unification removes the retcon optic while preserving both historical ADRs as artifacts.

Closing commit/ADR: 428971c

Claim

ADR-050 Revision 2 (2026-05-18 morning) attributed the Phase 5 full-FT OOD inference drop to an X11 FUSE EIO crash on /workspace MooseFS storage (operational forced-drop framing). ADR-052 (2026- 05-18 hours later) narrowly superseded ADR-050 R2 to reframe the drop as methodologically load-bearing with the FUSE crash as proximate trigger. A skeptical reader reading both in sequence sees a same-day retcon of the load-bearing reason for cutting a planned experiment. This ADR consolidates the two framings into one prospective narrative that names both the methodological reasoning and the operational trigger together, removing the same-day retcon optic without changing the underlying outcome (full-FT OOD inference remains dropped; LODO comparison ships 3-rung; OOD ships 2 trained rungs + 1 classical floor + 2 reference scorers per ADR-050 R1 + R2 + ADR-052).

Acceptance criterion

WRITEUP.md §8.1 + WRITEUP/limitations-and-future-work.md §8.1 + WRITEUP/model-rungs.md §4.x cite ADR-075 (or ADR-050 R1 for the LLM-judge axis) as the single source of truth on the full-FT OOD drop rationale. Future reviewers see one coherent narrative rather than the ADR-050 R2 + ADR-052 sequence framed as competing rationales. ADR-050 R2 + ADR-052 remain in decisions/ as historical artifacts.

CLAIM-076 - ADR-076: Backfill superseded_by on ADR-046 + ADR-054 + ADR-061 (→ ADR-062) + closing_commit on ADR-071-075 per ADR-072 frontmatter-backfill discipline

Status: Accepted

Source: REPO_AUDIT_2026-05-21.md §P1-6 (ADR-062 missing superseded_by back-links) + §P2 (ADR-071-075 empty closing_commit). SHAs verified via git log --oneline --diff-filter=A -- "decisions/ADR-<NNN>-*.md" for the file-introduction commit of each ADR. ADR-072 precedent applied to the back-link + closing_commit-backfill axis.

Closing commit/ADR: Not recorded.

Claim

ADR-062 (Quarto writeup clarity + canonical figures, v1.2.0) declares supersedes: [ADR-046, ADR-054, ADR-061] in its frontmatter (line 29) but the inverse links are missing — ADR-046 has empty superseded_by:, ADR-054 has superseded_by: [], ADR-061 has superseded_by: []. The supersession-chain forward links are correct; the back-links are stale. Separately, ADR-071 through ADR-075 (all 2026-05-20 governance ADRs) ship with empty closing_commit: frontmatter fields though the ADR-072 precedent (closing_commit 8105f37) establishes that closing_commit population is governance-relevant audit-trail data. REPO_AUDIT_2026-05-21.md §P1-6 + §P2 (ADR-071-075 empty closing_commit) surface both gaps. This ADR treats the empty-back-link + empty-closing_commit gaps as the same frontmatter-backfill narrow-relaxation class established by ADR-072 (extending the ADR-067-070 chain). Populates: (a) ADR-046 + ADR-054 + ADR-061 superseded_by: ["062"] to mirror ADR-062’s forward declaration; (b) ADR-071-075 closing_commit: with the verified SHAs from git log (ADR-071 = 37c2b32, ADR-072 = 8105f37, ADR-073 = ba342c7, ADR-074 = 14f0c05, ADR-075 = 428971c). No decision content changes; only audit-trail completeness is restored.

Acceptance criterion

grep '^superseded_by' decisions/ADR-046-*.md decisions/ADR-054-*.md decisions/ADR-061-*.md shows three ["062"] values (was empty/[]). grep '^closing_commit' decisions/ADR-071-*.md decisions/ADR-072-*.md decisions/ADR-073-*.md decisions/ADR-074-*.md decisions/ADR-075-*.md shows five populated SHAs (was empty). scripts/regenerate_audit.py --check passes after the backfill with 76 CLAIM rows (ADR-076 added).

CLAIM-078 - ADR-078: Absorb EXECUTIVE_SUMMARY content into README; retire EXECUTIVE_SUMMARY.md as standalone file

Status: Accepted

Source: User feedback 2026-05-21 surfacing the “neither narrative nor academic structure — random parts of results all over the place” diagnosis; subsequent /exploring-options Q1.1 lock (“the executive summary can be on the readme?”) + Q1 main scope reduction from 3 guides to 2 guides + README absorption.

Closing commit/ADR: Not recorded.

Claim

The reading-guide architecture before v1.3.0 has 4 reader-facing distillation surfaces — index.qmd (60-sec landing) + README.md (repo-level orientation) + EXECUTIVE_SUMMARY.md (1-page summary) + WRITEUP.md (methodology hub) — with same content rendered in 3-4 framings on EXECUTIVE_SUMMARY + index + WRITEUP. The v1.3.0 restructure (see ADR-079) introduces 2 reader-style guides (WRITEUP_PAPER.md + WRITEUP_NARRATIVE.md) which require a clear separation between the depth-0 entry (README + index) and the depth-1 reading guides. To avoid maintaining 3+ overlapping distillation surfaces, EXECUTIVE_SUMMARY.md is retired and its content absorbed into README.md as the README’s top-fold §“Executive summary” section. Behavior changes: (a) README.md becomes the canonical 1-page distillation reachable both from the repo root + via direct link on the live site; (b) EXECUTIVE_SUMMARY.md is removed from _quarto.yml render allowlist + sidebar + navbar + cross-references; (c) the historical reviewer URL pin (tree/v1.0.0 per ADR-033) preserves the EXECUTIVE_SUMMARY.md file at submission time so external academic citations of EXECUTIVE_SUMMARY remain resolvable. Decision affects presentation surfaces only; no methodology / model / data / compute change.

Acceptance criterion

ls EXECUTIVE_SUMMARY.md 2>&1 returns “No such file or directory” on the v1.3.0 tip. grep '^## Executive summary' README.md returns 1+ match. _quarto.yml render allowlist has no EXECUTIVE_SUMMARY.md entry; navbar Reference dropdown has no “Executive summary” entry; sidebar Entry section has no EXECUTIVE_SUMMARY.md entry. decisions/audits/REPO_AUDIT_2026-05-21.md does NOT cite the retirement as a delta (it precedes v1.3.0). Audit scripts all exit 0 (incl. v1.2.14 audit_adr_count_claims after 77→78 cascade per ADR-078’s creation).

CLAIM-079 - ADR-079: Two-guide reader architecture (academic IMRAD + narrative arc) replacing the single-hybrid WRITEUP for v1.3.0

Status: Accepted

Source: User feedback 2026-05-21 “lets do one narrative one academic and the readme which has the executive summary. The academic and narrative should still cover everything, but the style can be different. Does that make any sense?” — locked via /exploring- options Q1 (3-guide initial scope) + Q1-revisit (reduced to 2 guides + “lets think for a second” deliberation) + 4-question follow-on slate (file naming + sub-PR strategy + authorship + voice register).

Closing commit/ADR: Not recorded.

Claim

User feedback 2026-05-21 diagnosed the existing single-guide architecture as “neither a narrative structure nor an academic structure like in a journal paper” with “random parts of results all over the place with no story” — the existing WRITEUP.md was a loose-narrative-with-numbered-sections hybrid, methodology placed AFTER findings (WRITEUP §9 vs §7), with the same content rendered 3-4 times across index.qmd + EXECUTIVE_SUMMARY.md + WRITEUP.md + RESULTS.md in different framings. The v1.3.0 architectural response introduces two reader-style guides: WRITEUP_PAPER.md (academic IMRAD: Abstract / Introduction / Background / Methods / Results / Discussion / Limitations / Conclusion / References; formal academic register; passive voice; technical terminology with on-first-use definitions cross-referenced to docs/GLOSSARY.md) + WRITEUP_NARRATIVE.md (story arc: Hook / Setup / Investigation / Revelation / Other findings / Implications / Epilogue; plain- English first-person plural ‘we’ register; active voice; minimal jargon, defined on first use + cross-referenced to docs/GLOSSARY.md). Both guides cover the same content (problem, methodology, all 7 findings, mechanism, limitations); the register and structure differ. Each guide treats the headline finding natively per its style (academic = Finding 3 of 7 equal-weight; narrative = Act-3 dramatic revelation). Companion: ADR-078 absorbs EXECUTIVE_SUMMARY content into README. The current WRITEUP.md becomes a 1-page router stub directing readers to either guide; preserves backward references from 8 WRITEUP/ spokes + ADRs without breaking links. index.qmd rebuilt as 60-sec hook + chooser. READING_GUIDE.md rebuilt as 2-path router. Decision affects presentation surfaces only; no methodology / model / data / compute change.

Acceptance criterion

ls WRITEUP_PAPER.md WRITEUP_NARRATIVE.md returns both files. head -3 WRITEUP_PAPER.md shows the academic IMRAD title + reader-note pointing at WRITEUP_NARRATIVE.md as the alternative. head -3 WRITEUP_NARRATIVE.md shows the narrative title + reader- note pointing at WRITEUP_PAPER.md. head -5 WRITEUP.md shows the stub-redirect router language (“Pick how you want to read this”). head -5 index.qmd shows the 60-sec hook with chooser between the two guides. head -5 READING_GUIDE.md shows the 2-path router. _quarto.yml navbar Methodology dropdown lists “Academic paper (IMRAD)” + “Narrative arc (story)” + “Router (pick a guide)”. _quarto.yml sidebar “Methodology guides (pick a style)” section lists the same. scripts/audit_adr_count_claims.py exits 0 (catches the 78→79 cascade across reader-facing surfaces). scripts/audit_superseded_by_backlinks.py exits 0 (ADR-079 → ADR-053+054+061 supersession edges correctly classified as axis- only via comment heuristic).

CLAIM-080 - ADR-080: Correct reviewer URL pin numeric defect in ADR-078 + ADR-079 (tree/v1.2.8 → tree/v1.0.0); axis-only supersession on the reviewer-URL-pin axis

Status: Accepted

Source: Audit performed 2026-05-22 against the live GH-Pages deployment + repo source. Live verification confirmed all three surfaces render the wrong tree/v1.2.8 claim. ADR-033 §C + CHANGELOG v1.3.0 entry confirm tree/v1.0.0 as the source-of-truth pin. Audit lock via /exploring-options Q1 (audit-fix + ADR-080 axis-only supersession). Per-commit fix-forward at release/v1.3.1 PR-1.

Closing commit/ADR: Not recorded.

Claim

ADR-033 (GitHub release strategy + reviewer URL pin) canonically pins the historical reviewer URL at tree/v1.0.0 (ADR-033 §C artifact-pin table line 113: “Canonical source pin | Never drifts”). CHANGELOG v1.3.0 confirms (“Reviewer URL pin tree/v1.0.0 unchanged per ADR-033”). Three reader-facing surfaces contradict that pin by asserting tree/v1.2.8 as the historical reviewer URL pin “per ADR-033”: WRITEUP.md:48, decisions/ADR-078:164, and decisions/ADR-079:228 + decisions/ADR-079:291. ADR-078 even self-contradicts within one paragraph (line 163: “the tree/v1.0.0 pin (per ADR-033)” vs line 164: “Reviewer URL pin is now tree/v1.2.8”). The v1.2.8 references are factually wrong — ADR-033 does not pin v1.2.8 anywhere; the historical content the cited links intend to surface (the old single-hybrid WRITEUP.md jumbled-state + the retired EXECUTIVE_SUMMARY.md file) IS preserved at tree/v1.0.0 per ADR-033’s never-drift discipline. The v1.2.8 tag exists in repo history (it shipped the Quarto navigation restructure per ADR-061) but it is not the ADR-033 reviewer pin, and citing it as such misleads any reader following the link. This ADR corrects the numeric defect on the reviewer-URL-pin axis only: WRITEUP.md (mutable) is edited in place; ADR-078 + ADR-079 bodies remain unchanged per CLAUDE.md immutability; their superseded_by: frontmatter is backfilled to ["080"] per the ADR-076 / ADR-077 frontmatter-backfill narrow-relaxation discipline (extended from ADR-072 precedent). All other content in ADR-078 + ADR-079 (the EXECUTIVE_SUMMARY absorption decision + the two-guide reader architecture decision; their alternatives considered, consequences, linked ADRs) stands as locked. This is the FIRST axis-only supersession on a factual numeric-correction axis distinct from the prior axis-only supersessions on reading-guide-architecture axes (ADR-076 through ADR-079).

Acceptance criterion

grep -n 'tree/v1.2.8' WRITEUP.md returns zero hits (after the WRITEUP.md edit in this PR). grep '^superseded_by' decisions/ADR-078-*.md + ... ADR-079-*.md both show ["080"]. decisions/ADR-080-*.md exists with supersedes: ["078", "079"] (axis-only comments). scripts/regenerate_audit.py --check passes after the ADR-080 landing (CLAIM count 79 → 80). scripts/audit_adr_count_claims.py exits 0 (catches 79→80 cascade across reader-facing surfaces; 6th correct firing of the v1.2.14 invariant; the same hit-set as the v1.3.0 cascade plus the surfaces v1.3.0 missed: CLAUDE.md + CHANGELOG header narrative + WRITEUP.md + WRITEUP/methodology-guarantees.md). ADR-078 + ADR-079 body content (numeric claims, alternatives, decision rationale, prose) unchanged.

CLAIM-081 - ADR-081: Authorize frontmatter status: field split (verbose-context → pure-Nygard status: + new lifecycle-note:) as an extension of the ADR-072 / ADR-076 / ADR-077 frontmatter-backfill narrow-relaxation discipline; apply to ADR-060 as the seed case

Status: Accepted

Source: v1.3.2 multi-LLM audit cycle (Claude AUDIT_CLAUDE_2026-05-22 P3-3 finding) surfaced ADR-060’s non-Nygard status: field. Per /exploring-options 2026-05-22 Q2 lock (B1 — narrow-relaxation frontmatter edit, not in-place body change). The discipline matches the ADR-072 / ADR-076 / ADR-077 frontmatter-backfill chain — edit frontmatter, leave body untouched, document the precedent in a new ADR so future readers can see the audit trail.

Closing commit/ADR: Not recorded.

Claim

The v1.3.2 audit surfaced ADR-060’s frontmatter status: field as carrying non-Nygard verbose context: status: Accepted (methodology lock — infrastructure landed; execution deferred to v1.1.1 per /exploring-options 2026-05-19 Path B lock). The parenthetical context is load-bearing operationally (it explains why ADR-060 was marked Accepted at v1.1.0 even though execution didn’t land until v1.1.2 per ADR-063) but breaks the canonical Michael-Nygard status: <{Proposed, Accepted, Superseded, Deprecated, Rejected}> schema. ADR-060 is the only such ADR in the 80-ADR corpus (as of v1.3.1 close); future ADRs may similarly want to surface lifecycle-context that doesn’t fit pure Nygard. This ADR extends the existing frontmatter-backfill narrow-relaxation discipline (ADR-072 → ADR-076 → ADR-077) to a new field-split axis: any ADR whose status: is non-Nygard MAY have its frontmatter split in-place into pure-Nygard status: + a new lifecycle-note: field carrying the verbose context. No decision content changes; only frontmatter audit-trail metadata is restructured. The split preserves the prior author’s intent (the verbose context survives in lifecycle-note:) while restoring schema-compliance of status:. Applied to ADR-060 in the same patch (sole seed case as of v1.3.2): - Before: status: Accepted (methodology lock — infrastructure landed; execution deferred to v1.1.1 per /exploring-options 2026-05-19 Path B lock) - After: status: Accepted + new field lifecycle-note: methodology lock — infrastructure landed; execution deferred to v1.1.1 per /exploring-options 2026-05-19 Path B lock. decisions/README.md schema documentation is updated to enumerate lifecycle-note: as an OPTIONAL frontmatter field. Cascades from this ADR’s creation: - SUBMISSION_AUDIT.md CLAIM row count: 80 → 81. - README + docs/for-hiring-managers + WRITEUP/methodology-guarantees + CLAUDE.md ADR-count claims: 80 → 81 (caught mechanically by v1.2.14’s audit_adr_count_claims.py invariant — proving its design intent on the 7th consecutive ADR-add).

Acceptance criterion

grep '^status:' decisions/ADR-060-*.md returns status: Accepted (pure Nygard). grep '^lifecycle-note:' decisions/ADR-060-*.md returns the previously-parenthesized text. decisions/README.md documents lifecycle-note: as an OPTIONAL frontmatter field. scripts/audit_adr_count_claims.py exits 0 (the v1.2.14 invariant catches reader-facing surfaces’ “80 ADRs” → “81 ADRs” requirement; this ADR’s creation cascades through). scripts/regenerate_audit.py --check passes after the split with 81 CLAIM rows. scripts/audit_superseded_by_backlinks.py exits 0 (ADR-081’s axis-only supersession of ADR-060 is correctly classified via comment heuristic).