# Audit findings ledger

This ledger tracks findings from each cross-model methodology audit (Gate 3 per
[ADR 0003](adr/) — to be drafted at v0.48). Each row records the finding ID,
severity, finding summary, disposition (how it was resolved or scheduled), and
a link to the tracked issue (where applicable).

**Convention**: blocker-severity findings get a `p1-gate3`-labelled GitHub issue
for fix-tracking. Lower-severity findings are recorded here only.

**Cross-references**:
- The audit prompt template is at
  [`gate3-audit-prompt.md`](https://github.com/brandon-behring/eval-toolkit/blob/main/.claude/plans/gate3-audit-prompt.md)
  (local — not in published docs).
- The v1.0 plan that drives audit cadence is at
  `~/.claude/plans/evaluate-all-the-work-twinkly-kite.md` (local).
- Audit re-run schedule: after each breaking minor (v0.46, v0.47, v0.48) plus
  the original Round 5 pre-implementation pass. 7-day audit-completion timeout
  per gate.

---

## Round 5 (2026-05-21) — Codex + Gemini pre-implementation audit

**Reviewers**: author (manual) + Codex (independent report) + Gemini
(independent report).

**Packet**: v0.44.0 code state + the v1.0 release plan
(`~/.claude/plans/evaluate-all-the-work-twinkly-kite.md`) +
`docs/source/methodology/` (16 chapters) + `docs/source/roadmap.md` +
`CHANGELOG.md` + `src/eval_toolkit/*.py` + existing migration guides
(`migration/v0.7.md`, `v0.8.md`, `v0.9.md`).

**Audit prompt**:
[`~/.claude/plans/gate3-audit-prompt.md`](https://github.com/brandon-behring/eval-toolkit/blob/main/.claude/plans/gate3-audit-prompt.md).

| ID    | Severity              | Finding                                                                                              | Disposition                                                                                            | Issue |
|-------|-----------------------|------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|-------|
| R5-F1 | blocker before v0.46  | `scorecard()` first-party metric list mixes threshold-free + threshold-dependent metrics; `MetricSpec.compute(y, s) -> float` has no threshold input | **Decision R**: drop F1/accuracy/precision/recall from v0.46 specs; keep `metrics_at_threshold` + `ThresholdSelector` as separate path | _(in plan)_ |
| R5-F2 | blocker before v0.46  | `Scorecard` result type has no contract for undefined/skipped/error cells; existing `MetricState` infra unused | **Decision S**: reuse `MetricState` (`ok/skipped/error`) vocabulary from `artifacts.py:30-61`           | _(in plan)_ |
| R5-F3 | blocker before v0.47  | Sweep unification plan assumes `DelimitVariant`/`DatamarkVariant` classes exist; `preprocessing.py` exports only functions | Plan revised: create 3 preprocessing dataclasses at v0.47 as part of sweep work; sweep contract clarified (neutral default; explicit `attack_threshold` required for ASR) | _(in plan)_ |
| R5-F4 | blocker before v0.46  | Plan's scalar-deprecation directive would replace the lazy export resolver (`__init__.py:302-312`), shattering all root imports | Plan corrected: extend existing `__getattr__` with a transitional deprecation branch (BEGIN/END markers); v0.47 removes only that branch, base resolver intact | _(in plan)_ |
| R5-F5 | blocker before v1.0   | DeLong (`DeLongResult`, `delong_roc_variance`) is publicly exported + in API docs, but methodology + roadmap docs say "out of scope" | **Decision U**: keep public; update `methodology/comparison.md`, `methodology/reading_list.md`, `roadmap.md` to align with shipped state. Bundled into v0.48 §5E-prep | _(in plan)_ |
| R5-F6a | packet drift          | `cv_clt_ci` docstring claims "Bayle et al. 2020 prove a CV-CLT with a correction factor"; code does naive sample variance (no scalar correction) | Docstring fix bundled into v0.48 §5E-prep. Code is correct per Bayle (2020) Thm 3.1; docstring oversells | _(in plan)_ |
| R5-F6b | packet drift          | `methodology/parallelism.md:143-181` says "as of v0.34, harness not yet parallelized" + "once #29/#30 land" — but v0.36 wired `evaluate(n_jobs=)` + `evaluate_folded(n_jobs=)` | Doc fix bundled into v0.48 §5E-prep. Also clarify `bootstrap_ci`'s `n_jobs` is studentized-only       | _(in plan)_ |
| R5-F6c | packet drift          | `methodology/testing.md:108-136` says reference-equivalence gap "closing in PR 1.5"; roadmap state shows it's closed | Doc fix bundled into v0.48 §5E-prep                                                                    | _(in plan)_ |
| R5-F6d | packet drift          | `methodology/calibration.md:15-18` lists only temperature/isotonic/Platt; Beta + 4-binary-adapter family also shipped | Doc fix bundled into v0.48 §5E-prep                                                                    | _(in plan)_ |
| R5-F6e | packet drift + code   | `methodology/bootstrap.md` two-level example uses same array for val + test, causing ~63.2% overlap when `paired_bootstrap_op_point_diff` resamples val/test independently | Doc fix + defensive code guard bundled into v0.48 §5E-prep: rewrite example with disjoint split + raise `ValueError` in `paired_bootstrap_op_point_diff` if `val_y is test_y` | _(in plan)_ |
| R5-F6f | partial verified      | `CostSensitiveSelector` formula `t* = c_FP·(1-π) / (c_FP·(1-π) + c_FN·π)` is the prior-corrected form; existing docstring already cites Elkan's prior-independent variant — intentional design, but easy to misuse on already-calibrated probabilities | Docstring sharpening (add `Warning` admonition) bundled into v0.48 §5E-prep. Math itself is correct per the documented intent | _(in plan)_ |
| R5-F7 | governance risk       | Gate 3 redefinition (multi-model cross-review) is useful but not the same evidence class as external academic peer review; the plan understated this | **Decision O revised**: ADR 0003 explicitly states Gate 3 at v1.0 is internal model-assisted cross-review, NOT external academic peer review; documents what it catches and doesn't | _(in plan)_ |
| R5-Px | plan internal inconsistency | v0.46 verification smoke imports `TextTransform`, but Decision K introduces it at v0.47 | Plan smoke test corrected — `TextTransform` import removed from v0.46 smoke | _(in plan)_ |

**Round 5 totals**: 12 findings (5 blockers, 1 high-pre-v1, 5 packet drift, 1 governance, 1 plan inconsistency). All resolved or scheduled before any v0.46 code shipped.

**Key follow-on decisions** (driven by Round 5):
- **Decision W**: packet-drift fixes (F6a–F6f) bundle into v0.48 polish (not earlier hotfix).
- **Decision X**: scorecard skipped-status detection reuses `is_metric_defined_for_slice` from v0.39.0.
- **Decision Y**: Round 6 / 7 / 8 audits run as formal stop-gates between v0.46 / v0.47 / v0.48 / v1.0; 7-day timeout per gate.
- **Decision Z**: this ledger file.

---

## Round 6 (complete: 2026-05-21 — Codex + Gemini + manual; 11 findings)

**Ship date**: v0.46.0 tagged + published to PyPI 2026-05-21. STOP-GATE per
Decision Y.2 — `release/v0.47.0` opens once R6-F1 (BLOCKER) + R6-F2 (HIGH)
are fixed via v0.46.1 hotfix and the remaining items are scheduled.

**Reviewers**: author (manual) + Codex (independent) + Gemini (independent).

**Packet to send to Codex + Gemini**:

- This plan file (`~/.claude/plans/evaluate-all-the-work-twinkly-kite.md`)
- This audit ledger (`docs/source/audit_findings.md`)
- `docs/source/methodology/` (16 chapters — unchanged from Round 5)
- `docs/source/roadmap.md`
- `docs/source/migration/v0.46.md` (**NEW** — consumer migration guide)
- `docs/source/adr/0002-scorecard-as-primary-metric-surface.md` (**NEW**)
- `CHANGELOG.md` (new v0.45.0 + v0.46.0 entries since Round 5)
- Source tree, with focus on the v0.45 + v0.46 additions:
  - `src/eval_toolkit/stacking.py` (v0.45, ~370 lines)
  - `src/eval_toolkit/scorecards.py` (renamed from `_scorecard.py` at v0.49.0; ~585 lines)
  - `src/eval_toolkit/metric_specs.py` (v0.46, ~180 lines)
  - `src/eval_toolkit/__init__.py` — `__getattr__` deprecation branch +
    `_DEPRECATED_SCALARS` (Audit F4 invariant region)
  - `src/eval_toolkit/metrics.py:120-200` —
    `SINGLE_CLASS_INCOMPATIBLE_METRICS` extension (Round-5 X.2 precondition)

**Audit prompt**:
[`~/.claude/plans/gate3-audit-prompt.md`](https://github.com/brandon-behring/eval-toolkit/blob/main/.claude/plans/gate3-audit-prompt.md)
(local). The "Known issues already in the v0.48 backlog (skip re-reporting)"
section already lists drift items scheduled for v0.48 polish — Round 6
reviewers should skip those and surface only NEW findings against the v0.46
state.

**Focus areas** for Round 6 review:

- **scorecard surface design lock-in.** The Tier-2 `MetricSpec` Protocol
  freezes at v1.0 — method-signature changes require a v2.0 major bump.
  Last cheap chance to catch contract gaps.
- **MetricResult cell-state contract** — does the `ok` / `skipped` /
  `error` vocabulary cover every relevant failure mode? Are the reason
  strings useful for triage?
- **Per-cell error isolation** — confirm that catching all exceptions in
  `_evaluate_spec` doesn't hide important failures the user should see.
- **`__getattr__` deprecation shim** — Audit F4 invariant: does the branch
  correctly route deprecated names, NOT break non-deprecated resolution,
  and cleanly delete at v0.47?
- **Spec name encoding for parameterized metrics** — is
  `"ece_n_bins_15_strategy_uniform"` a stable v1.0 commitment, or does the
  alphabetize-kwargs rule create surprise keys for custom user specs with
  multi-kwarg signatures?
- **`Scorecard.to_pandas()` MultiIndex schema** — first-time-public; any
  shape lock-in concerns?

**Triage on findings**: each blocker → `p1-gate3`-labelled GitHub issue +
a row in this ledger. Either fix-as-v0.46.1-hotfix or fold into v0.47
design (per Decision Q severity-tiered hotfix policy).

| ID | Reviewer | Severity | Finding | Disposition | Lands |
|----|----------|----------|---------|-------------|-------|
| R6-F1 | Codex | **BLOCKER** before v0.47 opens | `metric_specs.ece(strategy="typo")` silently dispatches to quantile ECE and returns scorecard cell with `status="ok"` under invalid key (`"ece_n_bins_15_strategy_typo"`). Wrong-by-design data correctness path. Verified via Codex runtime probe. | Add strategy validation in `ece()` factory + `_EceSpec.compute()`; raise `ValueError("ECE strategy must be 'uniform' or 'quantile'; got {strategy!r}")` (plan §2.5A). | **RESOLVED v0.46.1** (commit `7a4bb14`, tag `v0.46.1` 2026-05-21; consumer pin bumped same day) |
| R6-F2 | Codex + Gemini | HIGH before v0.47 scalar hard-removal | ECE deprecation warnings in `__init__.py:_scorecard_spec_for()` emit broken migration snippets for all 5 ECE variants. Two-part bug: (a) for the 2 variants in `metric_specs`, the suggested scorecard key uses the factory-call expression (`"ece(n_bins=10)"`) instead of the encoded spec name (`"ece_n_bins_10_strategy_uniform"`); (b) for the 3 variants NOT in `metric_specs` (`_debiased`, `_l2`, `_l2_debiased`), the fallback name isn't an importable spec. Gemini claimed pre-v0.46 default was `n_bins=15` (verified incorrect — code at `metrics.py:730-734` shows `n_bins=10`); Decision R6-F resolves: warning uses `n_bins=10` to preserve pre-v0.46 math + adds migration note about new factory default. | Restructure `_scorecard_spec_for()` to return `(factory_expr, scorecard_key, has_first_party)` tuple; correct snippets for first-party variants with `n_bins=10`; submodule-path template for 3 non-first-party variants per Decision R6-G (plan §2.5B). | **RESOLVED v0.46.1** (commit `7a4bb14`, tag `v0.46.1` 2026-05-21; consumer-side smoke verified all 5 ECE-variant warnings + submodule-path routing) |
| R6-F3 | Codex | HIGH before scorecard freeze | Duplicate `MetricSpec.name` values in the same `scorecard()` call silently overwrite earlier cells (last-wins). Not a documented contract. | Decision R6-B (locked): reject in `scorecard()` with `ValueError("Duplicate MetricSpec name 'X' at index N; ...")`. Forces caller to disambiguate; no silent data loss. (Plan §4G.) | **v0.47** |
| R6-F4 (= Gemini R6-F1) | Codex + Gemini | HIGH before v1.0 | `scorecard(seed=None)` documented as non-deterministic; implementation coerces `None → 0`. Doc/impl contradiction. Verified by Codex via bit-for-bit equality test. | Decision R6-A (locked): deterministic-by-default; fix docs only. No behavior change. Plan §4G-prep. (Decision R6-E: rolls to v0.47 — R6-A is non-blocker per Decision Q's "docstring" category.) | **v0.47** |
| R6-F5 | Codex | Contract-enforcement gap before v1.0 | ADR 0003 promises strict Tier-2 Protocol method-shape stability; current public-API drift guard only snapshots `(*args, **kwargs)` for Protocol classes, not method signatures. The guard does not see changes to `MetricSpec.compute`, `MetaLearner.fit`, etc. | Decision R6-D (locked): extend `tests/test_public_api.py` snapshot to capture Protocol method signatures via `inspect.signature` + `typing.get_type_hints` for the 9 Tier-2 Protocols. (Plan §4I.) | **v0.47** |
| R6-F6 | Codex | Packet drift | v1.0 plan + roadmap still describe pre-v0.46 scorecard shapes that didn't ship: `ece_n_bins_15` without strategy in plan, `ece_quantile()` factory listed (shipped as `ece(strategy='quantile')`), `MetricUndefinedError` mentioned (ADR 0002 chose no new public exception), `n_resamples >= 100` floor (shipped is `>= 1`). Roadmap "Currently shipped" still says v0.44. | Plan §4L: refresh plan §3A scorecard examples + roadmap shipped-state section. Doc-only commit on v0.47 release branch. | **v0.47** |
| R6-F3 (Gemini) | Gemini | MEDIUM (schema lock-in before v1.0) | `Scorecard.to_pandas()` MultiIndex columns expose `value, status, reason, ci_low, ci_high, confidence` but drop `n_resamples` + `method` from `BootstrapCI`. Provenance loss compared to `to_dict()`. v1.0 is about to lock the schema. | Decision R6-C (locked): add `n_resamples` + `method` columns at v0.47 (additive). Schema becomes lossless against `to_dict()`. (Plan §4H.) | **v0.47** |
| R6-F4 (Gemini) | Gemini | LOW | `MetricSpec` Protocol doesn't enforce stable parameterized-spec naming. Custom users implementing multi-kwarg parameterized specs can silently spawn distinct dict keys if constructor arg order varies. | Decision R6-H (locked): add `make_spec_name(prefix, **kwargs)` canonicalization helper in `metric_specs.__all__` only (NOT top-level `_EXPORTS` — Tier-2 additive contract). Alphabetized kwargs, snake_cased, joined by underscore. (Plan §4J.) | **v0.47** |
| R6-F5 (Gemini) | Gemini | LOW | `_evaluate_spec()` wraps `spec.compute()` in broad `except Exception`. Swallows `MemoryError`, `RecursionError`, `KeyboardInterrupt`, `SystemExit` into cell state — process exhaustion / user-interrupt signals get hidden as metric errors. | Narrow exception catch: `except (MemoryError, RecursionError, KeyboardInterrupt, SystemExit): raise` first, then existing broad catch. (Plan §4K.) | **v0.47** |

**Round 6 totals**: 11 findings (Codex 6 + Gemini 5; 2 overlap on `seed=None` + ECE deprecation snippets but with different reasoning angles). 1 BLOCKER (R6-F1) + 5 HIGH + 2 MEDIUM/contract + 3 LOW. All dispositioned to either v0.46.1 (2 fixes) or v0.47 (9 fixes).

**Key follow-on decisions** (driven by Round 6 — locked in plan):

- **Decision R6-A**: `seed=None` deterministic-by-default; fix docs only.
- **Decision R6-B**: Reject duplicate `MetricSpec.name` with `ValueError`.
- **Decision R6-C**: Add `n_resamples` + `method` to `to_pandas()` schema.
- **Decision R6-D**: Extend public-API snapshot to cover Protocol method signatures.
- **Decision R6-E**: v0.46.1 scope = R6-F1 + R6-F2 only; R6-A rolls to v0.47 (non-blocker per Decision Q's "docstring" category).
- **Decision R6-F**: Use `n_bins=10` (pre-v0.46 default) in deprecation warnings + migration note about new v0.46+ factory default of `n_bins=15`. Corrects Gemini's misverified pre-v0.46 default claim.
- **Decision R6-G**: 3 ECE variants without `metric_specs` (debiased, l2, l2_debiased) route deprecation warnings to submodule path; do NOT add to `metric_specs` at v0.47.
- **Decision R6-H**: `make_spec_name()` helper in `metric_specs` submodule only; not top-level.

### Round 6 v0.46.1 ship status (2026-05-21)

- **R6-F1** ✅ SHIPPED in v0.46.1 (PR #67, squash `7a4bb14`). End-to-end verified
  in consumer: `ms.ece(strategy="typo")` raises `ValueError`; direct
  `_EceSpec(strategy=...)` construction also raises (defence-in-depth).
- **R6-F2** ✅ SHIPPED in v0.46.1 (PR #67, squash `7a4bb14`). End-to-end verified
  in consumer: `eval_toolkit.expected_calibration_error` warning carries
  `ece(n_bins=10)` + key `ece_n_bins_10_strategy_uniform` + migration note about
  v0.46+ `n_bins=15` default. All 3 non-first-party variants (debiased, l2,
  l2_debiased) route to `from eval_toolkit.metrics import …` submodule path.
- **9 other Round 6 items** (R6-A docstring, R6-B duplicate-name guard, R6-C
  to_pandas schema, R6-D Protocol method-shape snapshot, R6-F4-Gemini
  `make_spec_name`, R6-F5-Gemini narrow `except`, R6-F6 plan/roadmap state-drift)
  → folded into `release/v0.47.0` per Decision R6-E.
- **Round 6 STOP-GATE status**: CLOSED. `release/v0.47.0` can open after the
  v0.46.1 consumer cycle observation completes (1 cycle).

---

## Round 7 (complete: 2026-05-21 — Codex + Gemini; 3 substantive findings)

**Reviewers**: author (manual) + Codex (independent report) + Gemini
(independent report).

**Packet**: v0.47.0 code state + the v1.0 plan + `docs/source/methodology/`
(16 chapters) + ADRs 0001/0002/0003 + `docs/source/migration/v0.46.md` +
`docs/source/migration/v0.47.md` + Round 5/6 ledger.

**Round-7 briefing**: `gate3-audit-round-7.md` (committed `a9e1114`).

**Reports**: `gate3-audit-round-7-codex-report.md` + `gate3-audit-round-7-gemini-report.md` (untracked per `.gitignore`).

**Headline**: Codex 3 substantive findings; Gemini 0. Overlap was zero
between the two reports — the most consequential finding (R7-F1
doc-migration boundary gap between Sybil-tested fences and
MyST-NB-executed example notebooks) was Codex-only. Reinforces the
Round 6 pattern (do not use overlap as a confidence floor; single-reviewer
findings can be the most critical).

| ID | Reviewer | Severity (their words) | Finding | Disposition | Lands |
|----|----------|------------------------|---------|-------------|-------|
| R7-F1 | Codex | high before v0.48 | v0.47 doc migration missed MyST-NB executable example notebooks (separate from Sybil-collected `.md` fences). 6 example pages + 4 module-level docstrings + `protocols.md` autosummary + roadmap wording still reference removed APIs. Docs CI runs `sphinx-build` without `-W`, so notebook execution failures pass as advisory warnings. Verified via `sphinx-build` runtime probe — 6 execution failures buried in the warning stream. | Decision R7-A (locked at /exploring-options Q3): bundle into v0.48 §5G/§5H. §5G migrates the 6 notebooks + 4 docstrings + autosummary + roadmap; §5H enables `nb_execution_raise_on_error = True` in `conf.py`. Audit-as-seed expansion (Q2 locked full sweep) covered ALL module docstrings + drift in 5 existing `api/*.md` autosummary lists + 8 missing `api/*.md` pages. | **RESOLVED v0.48.0** (§5G commit `e07db16` + §5H commit `6349472` on `release/v0.48.0`) |
| R7-F2 | Codex | high before sweep freezes | `sweep()` records only `strategy.name` per row; two configured instances of same dataclass (e.g., `DelimitVariant(delimiter="<<")` + `DelimitVariant(delimiter="[[")`) silently merge under `groupby("variant")`. Style-coherent defect class with Round 6 R6-F3 (scorecard duplicate name) but with different semantics (row container vs. Mapping). | Decision R7-B option C (locked): emit `strategy_id` canonical column AND reject duplicate `strategy_id` at sweep boundary. Style invariants 1 (no silent failures) + 2 (natural call pattern is right) + 4 (canonical identifier + reject in canonical dimension) read together. | **RESOLVED v0.48.0** (§5I commit `f454afe`) |
| R7-F3 | Codex | worth fixing before v1.0 | `sweep()` doesn't validate scorer output cardinality. Three failure modes via runtime probe: overlong 1-D → silent truncation (worst); short 1-D → IndexError later; (n,2) matrix → TypeError when `float()` applied. | Decision R7-C (locked): API-level `ValueError` with contextual label at the sweep boundary; replaces all three low-level failure modes. Style invariants 1 + 3. | **RESOLVED v0.48.0** (§5J commit `fcf99f0`) |

### Gemini observations (Round 7)

Gemini's report verdict was "highly stable; release/v0.48.0 is safe to open." Six minor observations / validations; nothing critical that Codex hadn't covered. The actionable items folded into v0.48:

- §1-3 + 5-7: VALIDATIONS of v0.47 shipped state (`TextTransform` shape, shim removal, sweep design, R6-D Protocol method-shape snapshot, ADR 0003 tiers). No action needed.
- §4 (pedagogical drift): Gemini noted "from eval_toolkit.metrics import pr_auc" is syntactically green but slightly undermines ADR 0002. v0.48 §5G migration explicitly chose `scorecard()` for example notebooks teaching METRIC USAGE; submodule path only where teaching the underlying math.
- §4 (Makefile pre-push): Gemini recommended hardening to prevent the `pytest tests/` path-override trap. Landed as v0.48 §5L (`make pre-push` target running all 3 doc-execution surfaces; commit `9878a54`).
- §5 (R6-C dtype coercion): Gemini noted `n_resamples` (int + NaN) → `float64` is an accepted tradeoff. Landed as v0.48 §5K (Notes section on `Scorecard.to_pandas()` docstring; commit `6304cea`).
- §6 (SynonymSubstitution whitelist): Gemini recommended adding a docstring note about the hardcoded 6-entry whitelist. Landed as v0.48 §5K (`adversarial.py` Notes section; commit `6304cea`).

### Audit-as-seed extensions (v0.48)

Per user direction during plan refinement ("use the audits as seeds for things to reconsider"), the Round 7 findings + style-invariants framing surfaced additional v0.48 scope beyond Codex's explicit list:

- **§5G expansion**: from 4 Codex-flagged module docstrings to full sweep across `src/eval_toolkit/` module docstrings + audit of all `docs/source/api/*.md` autosummary pages. Found 8 missing API pages + 5 drifted autosummary lists.
- **§5M new**: in-source docstring drift audit (third doc-execution surface). Result: 82 PASS / 1 skipped / 0 fail; expanded `.doctest-modules` from 11 → 21 modules so CI catches future drift.
- **§5N comprehensive**: cross-API shape-validation consistency sweep beyond Codex's R7-F3 target. Audited `metrics_at_threshold`, `paired_bootstrap_op_point_diff`, `bootstrap_metric_from_predictions`, `metrics.py` scalars, `fit_*_binary` / `fit_*_calibrator`. Tightening commit landed for `metrics_at_threshold` silent threshold semantics (commit `76773dc`); `paired_bootstrap_op_point_diff` `val_y is test_y` guard landed as part of §5E-prep code-side fix (commit `5c8e68d`).

### Round 7 ship status

- **3 substantive Codex findings**: all RESOLVED in v0.48.0 via §5G + §5H + §5I + §5J.
- **6 Gemini observations**: all RESOLVED in v0.48.0 via §5G + §5K + §5L.
- **Audit-as-seed extensions** (§5G expanded, §5M new, §5N comprehensive): all RESOLVED in v0.48.0.
- **Round 7 STOP-GATE status**: CLOSED via v0.48.0 release. Round 8 audit STOP-GATE per Decision Y.2 opens against the v0.48.0 state before `v1.0.0` tag can land.

---

## Round 8 (2026-05-24) — STOP-GATE before v1.0 tag, rectified in v0.51.0

Multi-LLM cross-review fired against the v0.50.0 state. **Codex** returned
10 findings (3 with reproducible probes: C3, C4(a)/(b), C9); **Gemini**
returned 5 findings + 3 positive validations (V1 style, V2 methodology,
V3 scorer_error safety). Independent verification by Claude
(`audit-verification-codex-gemini-v0.50.0.md`, 510 lines, untracked
per `.gitignore`) confirmed 13 of 18 items, refuted 3 (R8-G2, R8-G5,
R8-V1+R8-V2 paired-as-one over-confident Gemini validations), and
deferred 2 (R8-G3, R8-G4) to v1.x as Tier-2 additive.
The v0.51.0 release ships fixes for all 13 confirmed items; 2 (G3, G4)
are deferred to v1.x as Tier-2 additive; 3 refuted entries are recorded
below for the audit trail.

Round 9 multi-LLM cross-review runs against the v0.51 RC before v1.0
tags.

### Confirmed (13) — fixed in v0.51.0

| ID | Severity | Finding | Disposition | Commit |
|----|----------|---------|-------------|--------|
| R8-C3 | blocker (escalated from Codex fix-rec) | `recall_at_fpr` fallback violated FPR ceiling (`actual_fpr=1.0` when `target_fpr=0.0`); probe `y=[0,1], scores=[1.0,1.0]`. | v0.51 sentinel `threshold=np.inf, actual_fpr=0.0, fp=0`. | `672d45f` |
| R8-C4(a) | blocker (escalated) | Generator-rng not parallel-stable in `_score_all_slices`; n_jobs=1 vs n_jobs=2 produced different CIs; SPEC 7 contract violation. | v0.51 spawns child SeedSequences at the dispatch boundary. | `61964f6` |
| R8-C4(b) | blocker (escalated) | `spawn_seed_sequences` ignored Generator state (extracted `bit_generator.seed_seq`). | v0.51 draws fresh entropy via `rng.integers(...)`. | `87453f6` |
| R8-C1 | fix-recommended | `evaluate_folded(seeds=)` reused the splitter across the seed loop — partitions never varied. | v0.51 adds `reseed_splitter` callback + DeprecationWarning when multi-seed + None. Warning persists past v1.0. | `f60d43c` |
| R8-C2 | fix-recommended | `SourceDisjointKFoldSplitter.iter_folds` yielded empty test partitions when k > n_sources; `get_n_splits` returned `min(k, n_sources)`. | v0.51 caps iter_folds at `min(k, n_sources)` + UserWarning. | `f60d43c` |
| R8-C5 | fix-recommended | README links to `docs/...` (broken); migration toctree listed only v0.7–v0.9 despite v0.49–v0.51 BREAKING releases. | v0.51 repoints all README links to `docs/source/...`; toctree + MIGRATION.md index extended through v0.51; v0.49/v0.50/v0.51 migration guides authored. | `c206b54` |
| R8-C6 | fix-recommended | `calibration.reliability_curve`, `maximum_calibration_error` did not validate `y_score ∈ [0,1]`; `fit_temperature` did not validate `bounds`. | v0.51 Tier-2 additive validation matching `metrics.py`-side ECE rigor. | `4c5e140` |
| R8-C7 | fix-recommended | Tests covered counts where semantics matter: `test_harness_folded.py:92-107` no partition-content check; `test_harness_parallelism.py:77-122` no Generator-rng; `test_protocol_conformance.py:228-240` no k>n_sources. | v0.51 adds semantic-property regression tests bundled with C1/C2/C4 fixes. | `87453f6` + `61964f6` + `f60d43c` |
| R8-C8 | minor | `SimilarityStrategy` listed as Tier-2 in README + `extending.md` but absent from `strict_tier2_protocols.md`'s 9-strict list. | v0.51 demotes SimilarityStrategy to "pre-v0.7 internal interface" in README + extending.md (aligns to v1.0 contract per ADR 0003). | `c206b54` |
| R8-C9 | minor | `GateResult.to_dict()` returned numpy + NaN unchanged; `json.dumps(..., allow_nan=False)` raised TypeError on the result. | v0.51 docstring documents the JSON-safety contract — strict-JSON requires `artifacts.write_json_strict` or `sanitize_for_json`. | `c206b54` |
| R8-C10 | minor | `.gitignore` patterns covered only `gate3-audit-*-report.md`; new `codex-comprehensive-audit-*-report.md` + `audit-verification-*.md` were tracked by default. | v0.51 `.gitignore` extended with new patterns. | `c206b54` |
| R8-F1 | fix-recommended (Claude — missed-by-both) | `losses.RecallAtLowFPR.__init__` did not validate `pos_weight > 0`; non-positive values produced degenerate-but-bounded loss values silently. | v0.51 eager validation matching sibling-kwarg pattern. | `4c5e140` |
| R8-F2 | minor (Claude — missed-by-both) | `metric_specs.ece(n_bins=)` validated `strategy` eagerly but deferred `n_bins` to compute time. | v0.51 eager `_validate_n_bins` call at factory level. | `4c5e140` |
| R8-F3 | minor (Claude — missed-by-both) | `analysis.CsvPredictionReader.read_predictions` silently filled missing CSV columns with empty strings → cryptic dtype error downstream. | v0.51 detects missing columns at read time; raises actionable `ValueError`. | `4c5e140` |

### Refuted (3) — recorded for audit trail; NO fix shipped

| ID | Severity (as auditor marked it) | Finding (rejected) | Rationale |
|----|---------|---------|-----------|
| R8-G2 | fix-recommended (Gemini) | "Cyclic import leakage→harness identified but left unresolved." | REFUTED: `harness.py:709-712` resolves the cycle via TYPE_CHECKING + lazy `noqa: PLC0415` import; comment explicitly states the resolution. Cycle is deliberately broken by an asymmetric-by-design pattern, not unresolved. |
| R8-G5 | minor (Gemini) | "Plotting tests use weak assertions (`assert fig is not None`)." | REFUTED: `tests/test_plotting_edge.py:125-140` does have that weak assertion, but the same file has structural assertions (line 92 `assert out is fig`; lines 220, 230 `.get_title()` equality). Cherry-picked; not representative. |
| R8-V1 | "Style + consistency: Exceptional" (Gemini) | Validation that the axis is exceptional. | REFUTED: Gemini's specific tooling claim (Ruff + Black + Mypy strict compliance) is true, but the axis is broader. Codex C1 + C5 + C6 + C7 + C8 + R8-G3 confirm substantive cross-API / docs / contract consistency drifts. Tooling-clean ≠ axis-exceptional. |
| R8-V2 | "Domain methodology: Masterclass" (Gemini) | Validation that methodology is masterclass. | REFUTED: Codex C1 + C3 + C4 + C6 confirm methodology bugs at fix-recommended-or-blocker severity. Methodology has real flaws; "Masterclass" overstates. Gemini's evidence (research-grounded tests exist) conflates research-citation discipline with methodology correctness. |
| R8-V3 | "scorer_error catches predictions safely" (Gemini) | Validation that the mechanism is safe. | CONFIRMED, recorded for completeness: `harness.py:468-474` correctly re-raises `MemoryError` + `AssertionError` BEFORE the broad `except Exception`, so critical exceptions escape the silent-capture path. |

### Deferred to v1.x (2)

| ID | Severity | Finding | Disposition |
|----|----------|---------|-------------|
| R8-G3 | fix-recommended (Gemini) | Codebase uses `ValueError` for every error condition (295 raises in src/; 0 custom exception classes). | Deferred to v1.x as Tier-2 additive. Custom exception hierarchy subclassing ValueError preserves catch-compatibility; adding it post-v1.0 is non-breaking. Reconsider when a downstream consumer requests it. |
| R8-G4 | minor (Gemini) | joblib OOM hazard documented at `_parallel.py:55-59` but not structurally mitigated (no memory-aware `n_jobs` capping in `harness.evaluate`). | Deferred to v1.x. Memory-aware capping needs RAM-measurement + dataframe-size accounting — non-trivial, no clear best-practice. Caller is responsible for `n_jobs` sizing under their RAM budget at v1.0. |

### Round 8 ship status

- **13 confirmed findings**: all RESOLVED in v0.51.0 via the
  per-finding commits above (Phase 1 + 2 + 3 + 4 on `release/v0.51.0`).
- **3 refuted findings** (R8-G2 cyclic-import framing, R8-G5
  cherry-picked weak test, R8-V1+R8-V2 over-confident Gemini
  validations paired-as-one): recorded above; no shipped fix needed.
- **2 deferred findings** (R8-G3 custom exceptions, R8-G4 joblib OOM
  capping): Tier-2 additive; v1.x or later.
- **Tally**: 13 + 3 + 2 = 18 total (canonical; aligned with
  CHANGELOG.md v0.51 section + migration/v0.51.md per RC4 reconciliation).
- **Round 8 STOP-GATE status**: CLOSED via v0.51.0 ship. **Round 9 audit
  STOP-GATE** per Decision Y.2 opens against the v0.51.0 RC before
  `v1.0.0` tag can land.

---

## Round 9 (2026-05-24) — STOP-GATE before v1.0 tag

Round 9 multi-LLM cross-review fired against the `release/v0.51.0` RC
(7 commits at the time of audit; PR #75 draft, CI green). **Codex**
returned 4 substantive findings (RC1-RC4) + self-validation worklog
(69/69 v0.51 regression tests pass); explicitly skipped the 10 modules
neither Round-8 auditor cited + 4 cross-cutting hunts. **Gemini** returned
6 items (RG1-RG6): 3 design challenges (2 escalating already-locked v0.51
decisions to v1.0 blockers) + 3 positive validations; zero probes, zero
line citations, ~45% validation density. Independent verification by
Claude (`audit-verification-round-9-v0.51.0.md`, 228 lines; gitignored
per R8-C10) confirmed **6 / refuted 3 / partial 1** of the 10 source-report
items, AND surfaced **3 third-audit findings + 3 minors** by hunting the
modules Codex skipped. Of those, F-sweep-1 was a CANDIDATE v1.0 BLOCKER
(R7-C "no silent failures" invariant violation on NaN scorer output)
that neither Round-8 nor Round-9 auditor caught.

Per the Path 3 lock from the post-R9 planning session, **two third-audit
findings (F-sweep-1 + F-bootstrap-1) ship fixed in this PR**; RC1
already fix-in-PR at `4c43771`; the remaining items defer to v1.0.1.

### Confirmed (6) — disposition mapped

| ID | Severity | Finding | Disposition | Commit / Defer-to |
|----|----------|---------|-------------|--------------------|
| R9-RC1 | fix-recommended (Codex) | README.md:208 has 2 broken `docs/extending.md` + `docs/examples/claims_and_gates.md` hyperlinks — R8-C5 regression (only the API-surface table-cell links; other R8-C5 sites are correct). | Fixed in PR. | `4c43771` |
| R9-RC2 | fix-recommended (Codex) | `SimilarityStrategy` contract mismatch: `__init__.py:294` exports it + `snapshot.json:1103-1117` pins `protocol_methods` (de facto strict Tier-2 STRICT contract) BUT `extending.md:18` + `strict_tier2_protocols.md` 9-strict list omit it. Snapshot IS the v1.0 contract regardless of docs. | Deferred to v1.0.1. Two paths: add to strict-Tier-2 list (10 strict + 1 opt-in; ADR 0003 update) OR remove from `__all__` + snapshot. | v1.0.1 |
| R9-RC3 | fix-recommended (Codex) | `test_evaluate_folded_reseed_splitter_varies_partitions` at `tests/test_harness_folded.py:134-164` only asserts key existence + count; no row-content comparison across seeds. Docstring claims partitions differ but assertions don't check it. Directly REFUTES Gemini RG6. | Deferred to v1.0.1 test hardening. | v1.0.1 |
| R9-RC4 | minor (Codex) | "3 refuted" counts categories (V1+V2 paired as one) while 4 named items listed; defensible across all 4 surfaces but invites confusion. | Deferred to v1.0.1 docs polish. | v1.0.1 |
| R9-RG4 | validation (Gemini) | metrics.py `SINGLE_CLASS_INCOMPATIBLE_METRICS` pattern is "elegant". | Validation accurate; pattern at metrics.py:35-99 is well-designed. | — |
| R9-RG5 | validation (Gemini) | bootstrap.py defaults match "industry standards". | Validation accurate: `bootstrap.py:66-69` shows N=1000, α=0.95, BCa, seed=42 — all match common conventions. Terminology nit: Gemini used outdated "seed=" instead of v0.50 "rng=". | — |

### Refuted (3) — recorded for audit trail; NO fix shipped

| ID | Severity (as auditor marked it) | Finding (rejected) | Rationale |
|----|---------|---------|-----------|
| R9-RG2 | v1.0 blocker (Gemini) | "DeprecationWarning without sunset date is structurally wrong; v1.0 blocker." | REFUTED: re-litigates Q3-locked v0.51 design. `harness.py:1413-1414` shows the deliberate implementation; lines 1347-1349 + 1368 document "warning persists past v1.0" as chosen contract. Tested by `test_evaluate_folded_multi_seed_without_reseed_emits_deprecation_warning`. Meta-pattern: Gemini escalates without engaging with locked reasoning (same as R8 V1+V2). |
| R9-RG3 | v1.0 blocker (Gemini) | "iter_folds capping silently is wrong; should raise." | REFUTED: re-litigates R8-C2 design lock. `splits.py:325-339` shows the deliberate cap-with-UserWarning implementation; warn-vs-raise IS a judgment call; the v0.51 decision was permissive-by-design (caller can opt into `warnings.filterwarnings('error', UserWarning)` for strictness). |
| R9-RG6 | validation (Gemini) | "Regression tests apply strict invariant assertions, not weak proxies." | REFUTED — directly contradicted by RC3 which I independently confirmed. `tests/test_harness_folded.py:134-164` is precisely the weak-proxy pattern Gemini claims doesn't exist. Gemini didn't read the cited tests; textbook over-confident validation. |

### Partial (1)

| ID | Severity | Finding | Disposition |
|----|----------|---------|-------------|
| R9-RG1 | accept-with-rationale (Gemini) | `np.inf` sentinel from `recall_at_fpr` risks silent downstream failure. | PARTIAL: internally NO callers (grep across `src/` returned only function def + docstring mentions + an unrelated local var in `losses.py:156`). Sentinel risk is real only at user-code boundary; mitigated by Q1-locked design + docstring's "caller may filter via np.isinf" guidance. Accept-with-rationale at v1.0. |

### Third-audit findings (Claude — modules neither auditor cited) — 3 substantive + 3 minor

| ID | Severity | Finding | Disposition | Commit / Defer-to |
|----|----------|---------|-------------|--------------------|
| R9-F-sweep-1 | **blocker for v1.0** (Claude escalation) | `_sweep.py:_validate_scorer_output()` accepts NaN/inf scorer outputs without validation; R7-C "no silent failures" invariant violation in a module Codex R7 designed but didn't audit for finiteness. Stacking.py validates non-finite scores; sweep.py didn't. | Fixed in PR. NaN/inf check added to R7-C boundary; test bundled. | New commit on `release/v0.51.0` (this PR) |
| R9-F-bootstrap-1 | fix-recommended | scipy's BCa returns degenerate CIs (`ci_low == ci_high == point` or NaN bounds) on small n + ceiling/floor metrics; scipy emits DegenerateDataWarning but doesn't raise; pre-v0.51 R8-C4(b) RNG bug may have masked this. | Fixed in PR. UserWarning added at the bootstrap_ci boundary when BCa degenerates; test bundled. | New commit on `release/v0.51.0` (this PR) |
| R9-F-bootstrap-2 | minor | `mde_from_ci` width check `if width <= 0` doesn't catch NaN width (NaN <= 0 is False in IEEE float); silent NaN MDE if BCa returns NaN bounds. | Bundled into the F-bootstrap-1 commit. `or not np.isfinite(width)` guard added. | New commit on `release/v0.51.0` (this PR) |
| R9-F-metrics-1 | minor (Claude) | `metrics.py:1305` Brier-decomposition docstring claims "≈ 1e-9" precision but actual error ~3e-3 dominated by binning. Self-contradictory. | Deferred to v1.0.1 docs polish. | v1.0.1 |
| R9-F-metrics-3 | fix-recommended (Claude) | `expected_calibration_error(y=[0,0,1,1], y_score=[0.5,0.5,0.5,0.5], n_bins=10)` returns 0.0 — technically correct per formula but semantically misleading (uninformative model looks perfectly calibrated). Equal-mass binning correctly rejects n<n_bins; equal-width has no guard. | Deferred to v1.0.1; either guard against all-same-score input OR explicitly document. | v1.0.1 |
| R9-F-metrics-4 | minor (Claude) | `brier_score` single-class docstring ambiguous about whether single-class is supported; implementation correctly handles it but wording is "implementation detail" not contract statement. | Deferred to v1.0.1 docs polish. | v1.0.1 |

### Round 9 ship status

- **6 confirmed source-report findings**: 1 fixed in PR (RC1 → `4c43771`),
  4 deferred to v1.0.1, 2 validation-confirmations (RG4, RG5 — no action).
- **3 refuted source-report findings** (RG2, RG3, RG6): recorded above for
  audit trail; no shipped fix.
- **1 partial** (RG1): accept-with-rationale at v1.0.
- **2 third-audit fixes shipped in this PR**: F-sweep-1 (CANDIDATE BLOCKER
  closed); F-bootstrap-1 + F-bootstrap-2 bundled (degeneracy warning +
  NaN guard).
- **4 third-audit items deferred to v1.0.1**: F-metrics-1, F-metrics-3,
  F-metrics-4, RC2 (SimilarityStrategy contract resolution).
- **Round 9 STOP-GATE status**: **CLOSED with mitigations** per the
  Path 3 lock. v1.0 can tag from this PR post-CI-green.

### Multi-LLM audit-machinery calibration notes

- **Codex R9 quality**: 4/4 findings hold; RC1 over-cited (3 sites named
  but only 1 had real broken links); no padding finding this round.
  **Gap remains**: explicit skip of the 10 untouched modules + 4
  cross-cutting hunts. F-sweep-1 (the candidate blocker) lives in
  precisely that gap. Future-round briefings should make the
  untouched-modules + hunts mandatory.
- **Gemini R9 quality**: 2/6 validations honest (RG4, RG5); 3/6 REFUTED
  (RG2, RG3, RG6). RG6 is the new V1/V2 pattern — uncited positive
  validation on tests Gemini didn't read; directly contradicted by
  Codex's RC3 + my verification. RG2 + RG3 escalate locked design
  without engaging with the recorded rationale.
- **Trust calibration**: Codex's probe-backed findings → high trust;
  Codex's untouched-module gaps → fill with Claude third-audit. Gemini's
  validations → unreliable; Gemini's design dissent → "raises the
  question" not "v1.0 blocker."

---

## Round 10 (2026-05-25) — v1.0 pre-tag micro-audit

Scoped Codex + Gemini micro-audit on the `edadddc` R9 follow-on commit
only (full `~/Claude/audit-templates/audit-prompt.md` template with a
scope-override preamble constraining attention to ~126 LOC src + ~110
LOC tests in `_sweep.py` + `bootstrap.py`). Dispatched per the Round
10 locking decision: targeted micro-audit, not full multi-LLM gate
re-run. **Codex** returned a 311-line report with 3 substantive
findings + per-axis verdicts + 4-test probe verification (all 4
passed). **Gemini** returned an 86-line report with 2 findings (1
self-recommended-status-quo + 1 fix-recommended on metrics.py
finiteness — direct contradiction with Codex's "Completeness"
verdict). Claude verification re-read every cited line. Reports
archived locally at `codex-microaudit-edadddc-report.md` +
`gemini-microaudit-edadddc-report.md` (gitignored).

### Confirmed (3) — disposition mapped

| ID | Severity | Source | Finding | Disposition | Commit |
|----|----------|--------|---------|-------------|--------|
| R10-F1 | fix-recommended | Codex | `_sweep.py:298-299` error message says "finite floats in [0, 1]" but check is `np.isfinite` only — no range enforcement. Cross-ref `protocols.py:29-51` Scorer Protocol docstring also lacked explicit `[0, 1]` contract statement. | Fixed in PR (Codex Option C): extend Protocol docstring to document `[0, 1]` calibrated-probability semantics + reword sweep runtime message to drop `[0, 1]` (boundary still doesn't enforce range; enforcement deferred to a future minor once consumer usage patterns clarify). | R10 follow-on commit on `release/v0.51.0` |
| R10-F2 | fix-recommended | Codex | `tests/test_bootstrap_unit.py:337-343` BCa degeneracy test uses `if ci.ci_low == ci.ci_high == ci.point_estimate:` — silently no-ops when scipy returns NaN bounds (NaN==NaN is False in IEEE float). Test passes WITHOUT proving the warning fires. Codex probe-verified the no-op path on the current scipy fixture. | Fixed in PR (Codex Option A): mirror the production predicate — `(not np.isfinite(low)) or (not np.isfinite(high)) or (low == high == point)`. Assertion block runs whenever ANY degeneracy mode fires. | R10 follow-on commit on `release/v0.51.0` |
| R10-F3 | minor-observation | Codex | `bootstrap.py:1099-1105` Raises docstring says "non-positive width" but `:1132-1140` implementation rejects on `width <= 0 or not np.isfinite(width)`. Code stricter than docs. | Fixed in PR (Codex Option A + brief Option B context): update Raises text to "non-positive or non-finite width" + 4-line note explaining scipy BCa NaN-bound motivation. | R10 follow-on commit on `release/v0.51.0` |

### Partial (1) — accept-as-design

| ID | Severity | Source | Finding | Disposition |
|----|----------|--------|---------|-------------|
| R10-RG1 | minor-observation | Gemini | `bootstrap.py:376-386` BCa degeneracy check uses output proxy (`ci_low == ci_high == point` + non-finite bounds) rather than catching scipy's internal `DegenerateDataWarning`. May miss edge-case degeneracies where jackknife acceleration is undefined but bounds don't exactly collapse. | Accept-as-design. **Gemini's own recommendation was Option B (status quo)**: "Deterministic, fast, and covers the primary 'small n + ceiling metric' failure mode perfectly." Output-proxy approach avoids `warnings.catch_warnings` overhead on a hot path. Catching internal `DegenerateDataWarning` is a potential v1.1 enhancement if real-world degeneracy modes surface that the output proxy misses. |

### Refuted (1) — Pattern-1 calibration record

| ID | Severity (as Gemini marked it) | Source | Finding | Rationale |
|----|----------|--------|---------|-----------|
| R10-RG2 | fix-recommended | Gemini | "`metrics.py` functions like `pr_auc` and `roc_auc` lack a similar `np.isfinite` boundary check." | **REFUTED**. `metrics.py:_validate_inputs` (line 1846) explicitly contains `if not np.isfinite(y_score_arr).all(): raise ValueError("y_score contains NaN or inf")`. Called at 20+ public-function sites (lines 366, 441, 479, 552, 650, 802, 872, 942, 1036, 1157, 1261, 1337, 1429, plus more). The check predates the audit chain (the comment "harmonizes with `score_distribution_summary`'s own guard" suggests pre-v0.46 vintage). Gemini did not read the cited code before making the claim — textbook **Pattern-1 (validation-without-reading)** violation, despite the micro-audit prompt's `Calibrated-confidence discipline` section explicitly calling out this exact pattern. The calibration record is preserved here. |

### Multi-LLM audit-machinery calibration (R10)

- **Codex R10 quality**: 3/3 substantive findings hold; probe-backed; cited file:line on every observation; honest "What I didn't look at" calibration. Same depth and rigor as R8 + R9.
- **Gemini R10 quality**: 1/2 findings honest (F1, with self-recommendation = status quo — effectively an open-question framed as a finding); 1/2 REFUTED (F2, Pattern-1 violation). **The prompt's explicit Pattern-1 discipline section did NOT change the outcome.** Pattern-1 is now confirmed across R8 V1+V2, R9 RG6, and R10 RG2 — four independent rounds. Conclusion: Gemini's training-trace defaults toward positive validation without reading; explicit prompt-level discipline is necessary but not sufficient to correct it. Practical implication: future audits should treat Gemini validations as "raises the question" and require Claude (or Codex) read-back before accepting any positive validation.
- **Round 10 STOP-GATE status**: **CLOSED with mitigations** — 3 Codex confirmed findings fixed in this RC; 1 Gemini accept-as-design; 1 Gemini refuted. v1.0 can tag from the R10 follow-on commit post-CI-green.

---

## Round 11 (2026-05-26) — consumer-feedback-driven validator adoption

**Not a multi-LLM gate-style audit.** Round 11 is the first
consumer-feedback-driven round: consumer-side audit findings → upstream
validator shipped → consumer adopts. Multi-LLM cross-review (Rounds
5–10) was the v1.0 stability-contract activation discipline; Round 11
is the v1.x ongoing consumer-feedback loop. Future Round 12+ would
resume multi-LLM dispatch only if v2.0 work or a major audit cycle
re-opens. This is flagged so readers don't confuse R11 with the
heavyweight gate-style rounds.

### Motivating findings (consumer audit origin)

- **V1.3.1 ADR-080 audit-fix** (2026-05-22, consumer
  `prompt-injection-detection-prototype`): `WRITEUP_NARRATIVE.md:38`
  said "TF-IDF + LR reaches 0.974 AUPRC" but canonical TF-IDF direct
  val AUPRC = 0.971 (0.974 is LoRA's value). The existing
  `audit_numbers.py` validates VALUES against source data but not
  BINDINGS — both 0.971 AND 0.974 exist in the expected set; the bug
  is the wrong pairing.
- **V1.3.2 P1-2 Part 2** (2026-05-23, consumer same repo):
  `docs/REPRODUCIBILITY.md:76` cited "(locked at Phase 0-07 via
  ADR-029)" — ADR-029 is the test-marker-strategy ADR, not the
  reproducibility tier-lock ADR (which is ADR-034). 5-digit-off
  citation; lychee resolves it (ADR-029 exists), `audit_numbers.py`
  validates values, but neither catches semantic mis-citation.

### Upstream response timeline

| Time (UTC) | Event |
|---|---|
| 2026-05-24 | Consumer files #71 (value_bindings), #72 (concept_drift), #73 (citation_alignment) at v1.3.3 |
| 2026-05-24 (later) | PR #74 opens with audit subpackage skeleton + `validate_citations` |
| 2026-05-25 17:01Z | v1.0.0 tag (stability contract activates per ADR 0003) |
| 2026-05-25 17:08Z | v1.0 snapshot regen hotfix (`eb94d26`) |
| 2026-05-25 22:51Z | PR #74 merged (`audit_citation_alignment` flat-module per ADR 0001; restructured from initial subpackage shape) |
| 2026-05-25 23:48Z | **v1.0.1 ships** — `audit_citation_alignment` + RC4 docs polish |
| 2026-05-26 01:02Z | v1.0.2 ships — #76 cleanup batch (RC2 SimilarityStrategy promoted to 10th strict Tier-2; RC3 test row-content hardening; F-metrics docstring polish) |
| 2026-05-26 01:48Z | **Consumer PR #10 merged** at v1.3.7 — consumer-side `audit_citation_alignment` adoption complete |
| 2026-05-26 02:35Z | **v1.0.3 ships** — `audit_value_bindings` + closes #71 |

End-to-end cycle: V1.3.2 P1-2 origin (2026-05-23) → consumer
adoption (2026-05-26 01:48Z) = ~3 days. Upstream-side v1.0.0 → v1.0.3
(audit-validator family of 2): ~10h compressed cycle. Consumer's
v1.3.7 PR opened against `audit_citation_alignment` (the v1.0.1
deliverable); subsequent v1.0.3 `audit_value_bindings` adoption will
happen on consumer's own cadence.

### Audit-validator family status

| Validator | Module | Released | Issue | Status |
|---|---|---|---|---|
| Citation alignment | `audit_citation_alignment` | v1.0.1 (`067d4b0`) | #73 | ✅ shipped + consumer-adopted (consumer PR #10 v1.3.7) |
| Value bindings | `audit_value_bindings` | v1.0.3 (`775f3da`) | #71 | ✅ shipped; consumer adoption pending |
| Concept drift (embedding-clustering) | `audit_sister_doc_concept_drift` (proposed) | future | #72 | open; requires MiniLM (already vendored for leakage) + clustering design |

### Pre-tag + post-v1.0.3 dogfood summary

The validators were dogfooded against eval-toolkit's own markdown
surface to prove they work in production before consumer adoption:

- **v1.0.1 pre-tag dogfood** (`validate_citations`): 95 files
  (README + docs/source/**/*.md ex-ADRs ex-Sphinx-generated); 0
  misalignments. Helper `_build_adr_subjects_from_repo()` exercised.
- **Post-v1.0.3 dogfood** (re-run + synthetic `validate_reader_value_bindings`):
  - Part 1 (citation_alignment expanded): 96 files; 0 misalignments
    (sanity preserved after audit_value_bindings.md + R10 + R11
    additions to docs/source).
  - Part 2 (value_bindings synthetic): bindings derived from Round 8
    RC4-reconciled tally (`("Round 8", "confirmed_count") -> 13` etc).
    Validator processed `audit_findings.md` + `migration/v0.51.md` +
    `CHANGELOG.md` without crashing; surfaced 216 pattern-matches +
    82 matches. **The high false-positive count is expected** — the
    synthetic bindings dict has very loose detector ("Round 8") +
    metric (numeric-count) patterns, and any number near a "Round 8"
    mention triggers. **NOT real bugs in eval-toolkit's docs** —
    just the validator successfully exercising the regex+window
    pipeline at production-scale markdown.

The dogfood script is preserved at `.scratch/dogfood_v1_0_x.py`
(gitignored) for future re-runs.

### Round 11 outcome

- ✅ Action 1 (consumer adoption): consumer PR #10 merged 2026-05-26 01:48Z. Consumer-side dogfood + adoption complete for `validate_citations`.
- ✅ Action 2 (this ledger entry): R11 section recorded.
- ✅ Action 3 (dogfood expansion): validators exercised on eval-toolkit's own surface; no upstream-side findings.

GH issue #77 closed with state_reason=completed.

### Multi-LLM audit cadence after R11

The Gate 3 audit chain (R5–R10) closed at v1.0 per ADR 0003. R11+ is
the post-v1.0 consumer-feedback loop and uses NO multi-LLM cross-review
by default. Multi-LLM dispatch resumes only if:

1. A v2.0 design cycle opens (would require a Gate 3-equivalent
   re-audit per ADR 0003 §"v2.0 re-audit triggers").
2. A consumer files a major-severity finding that single-LLM
   verification (Claude only) leaves uncertain — in which case Codex
   + Gemini dispatch via the `audit-prompt.md` template re-engages
   for that specific question.

Otherwise R11+ is light-touch documentation of the consumer-feedback
→ upstream-validator → consumer-adopt cycle.

---

## Round 12 (2026-05-26) — schema-extensibility + scope-correctness lesson from #80

**Not a multi-LLM gate-style audit.** Round 12 is the second
consumer-feedback-driven round, immediately following R11. The R11
audit-validator family (3/3 shipped) revealed a deeper structural
gap in `audit_value_bindings` that the v1.0.3 implementation could
not address: the 2-tuple `(detector, metric)` canonical-identity
schema collapsed across slices, producing ~95 false positives on
the consumer's writeup. The fix (v1.1.0) is bigger than the issue's
literal ask — it surfaces two architectural rules that now govern
all audit validators (pending ADR 0005).

### Motivating finding (consumer audit origin)

- **Consumer's v1.3.9 run** (2026-05-26, prompt-injection-detection-submission):
  `scripts/audit_value_bindings.py` against own writeup produced
  **96 warnings, ~95 false positives**. The one real bug
  (`WRITEUP_PAPER.md:545` mis-citing 0.971 inside a misleading
  paragraph context) was correctly flagged but hid in the noise.
  Consumer caught it via parallel manual audit, **not** via this
  validator. Filed [#80](https://github.com/brandon-behring/eval-toolkit/issues/80)
  at 2026-05-26T18:15:45Z proposing a 3-tuple
  `(detector, metric, slice)` schema extension.

### Upstream response timeline

| Time (UTC) | Event |
|---|---|
| 2026-05-26T18:15Z | Consumer files #80 (BINDINGS slice-axis schema) |
| 2026-05-26T18:30Z | `/exploring-options` Round 1: P0 path for v1.0.4 PyPI gap. Decision: refined bundle (rerun + workflow_dispatch + verify-receipt). |
| 2026-05-26T19:00Z | `/exploring-options` Round 2: #80 migration shape. Decision: **Option 4 — `BindingKey` structured key + multi-shape adapter** (rejected Option 1 `(d, m, s)` in-place tuple because positional tuples lock in the recur-every-N-months schema-event pattern). |
| 2026-05-26T19:24Z | **v1.0.5 ships** — infrastructure-only release (publish workflow hardening; dress rehearsal for the new Verify PyPI receipt step). |
| 2026-05-26T19:30Z | `/exploring-options` Round 3: Step 1 release strategy + verify specs. Decision: v1.0.5 as dress rehearsal; conservative 12×30s PyPI poll. |
| 2026-05-26T19:45Z | `/exploring-options` Round 4: ADR 0005 scope + deprecation policy. Decision: medium scope (audit validators only); indefinite acceptance of legacy 2-tuple BINDINGS through v1.x. |
| 2026-05-26T20:00Z | First dogfood of v1.1.0 slice-axis fix: only 22% noise reduction (95 → 74). User pushed back: "what is the right long-term solution?" Triggered the deeper rethink. |
| 2026-05-26T20:10Z | Identified the **second architectural layer**: scope correctness. ~80% of v1.0.5 residual noise was content-type confusion (CI brackets, table cells, code blocks), not slice-axis confusion. Added `scope='narrative'` filter. |
| 2026-05-26T20:11Z | **v1.1.0 ships** — `BindingKey` + slice-aware matching + `scope='narrative'` (closes #80). |

End-to-end cycle: #80 filing (18:15Z) → v1.1.0 ship (20:11Z) =
**~2 hours compressed cycle**. Four `/exploring-options` rounds +
one user-driven architectural redirect ("ultrathink on what the
right long-term solution is") produced a coherent two-layer fix
instead of the issue's narrower literal ask.

### The two-layer correctness lesson (pending ADR 0005)

Audit validators have two correctness layers, not one:

1. **Identity correctness** — canonical measurements have
   *structured identity* (frozen dataclass with named fields),
   not positional tuples. Future identity axes (split, ci_kind,
   source_ref, ...) added as defaulted fields without breaking
   the dict-key schema. Positional tuples lock the validator
   into a recur-every-N-months schema-event pattern.
2. **Scope correctness** — the validator should only scan content
   *plausibly a binding claim*. Narrative prose sentences are.
   Markdown table cells aren't (they're structured data,
   audited differently). Bracketed expressions (`[CI 0.286,
   0.301]`) aren't (inline expressions, not point estimates).
   Code blocks aren't.

This mirrors lint-design conventions from `ruff`/`mypy`/`bandit`
(scope predicates like `# noqa`/`# nosec`/`# type: ignore` are
first-class, not optional).

### Dogfood evidence

| Configuration | Warnings on consumer HEAD | Reduction vs v1.0.5 baseline |
|---|---|---|
| v1.0.5 (legacy 2-tuple) | 95 | — (baseline) |
| v1.1.0 BindingKey + slice-fix only (scope='all', default `slice_window_chars=120`) | 74 | -22% |
| v1.1.0 BindingKey + slice-fix + `scope='narrative'` (recommended) | **23** | **-76%** |

The first instinct — "just add a slice axis like the issue asks"
— would have left the validator practically unusable on dense
academic prose, requiring a separate v1.2.0 cleanup release. The
deeper rethink (triggered by the user's redirect) delivered both
layers in a single coherent v1.1.0.

Residual 23 warnings are pre-existing positional-heuristic
limitations (sentence-boundary unawareness, multi-detector list
parsing on dense prose) not addressable without parser-level
work; deferred to future v1.2.0+ as a narrower issue class.

### Round 12 outcome

- ✅ **v1.1.0 shipped** (commit `911565b`, PyPI live).
- ✅ **#80 auto-closed** by `closes #80` directive on the v1.1.0 tag.
- ✅ **0 open issues** on the eval-toolkit repo post-#80.
- ✅ **Architecture lesson codified** in pending
  [ADR 0005](adr/0005-structured-keys-for-audit-validators.md):
  the identity + scope two-layer rule applies to future audit
  validators.
- ✅ **Memory entry**: `feedback_validator_identity_plus_scope`
  captures the "issue body says X is the problem; investigate the
  dominant noise source before agreeing" discipline.

### Consumer adoption + HARD-gate viability

Consumer (`prompt-injection-detection-submission`) currently pins
`eval-toolkit>=1.0.3,<2`. Adoption path for v1.1.0:

- Smallest diff: replace 2-tuple `BINDINGS` literal with 3-tuple
  keys (issue body's proposal; works directly).
- Recommended: migrate to `BindingKey(detector=..., metric=...,
  slice=...)` for forward-extensibility.
- Add `scope="narrative"` to the validator call.
- Re-run; expect ~76% noise reduction.

HARD-gate promotion at consumer's v1.3.10+ becomes credible at the
~80% reduction level. Remaining residual false positives can be
suppressed via consumer-side filtering (excluding lines containing
"random floor" or "versus") or accepted as known low-frequency
noise pending v1.2.0+ pairing-rule improvements.

### Multi-LLM audit cadence after R12

Same as R11: no multi-LLM cross-review for consumer-feedback
rounds; reserved for v2.0 design cycles or major-severity findings
that single-LLM verification cannot resolve. The R12 fix was
informed by the user-driven `/exploring-options` discipline
(4 rounds + a structural redirect), which substitutes for
multi-LLM dispatch for this class of architectural decision.

---

## Round 13 (2026-05-26) — `audit_value_bindings` v1.2.0 context-aware noise reduction

**Not a multi-LLM gate-style audit.** Third consumer-feedback-driven
round, immediately following R12. R12 closed the slice-axis
schema-extensibility gap (BindingKey + scope='narrative' content-type
filter) and reduced the consumer's noise floor from 96 → 36 warnings
(62%). R13 addresses the residual 36 — the positional-heuristic
limitations ADR 0005 named as "Future work (deferred)" — via four
context-aware extensions to `scope='narrative'` shipped as v1.2.0.

### Motivating evidence (consumer-side, post-v1.3.11)

Consumer (`prompt-injection-detection-submission@v1.3.11`) ran the
v1.1.0 `audit_value_bindings` against their writeup and produced 36
residual false positives. Consumer-side categorization in
`decisions/upstream_issues.md:92`:

| Category | Count | Discriminating prose pattern |
|---|---|---|
| Random-floor / sub-clause mentions | ~10 | "random floor of 0.374", "baseline of 0.374" |
| Delta with sign / keyword | ~9 | "-0.071 AUPRC", "drops -0.132 below" |
| Multi-detector "vs"/"versus" enumeration | ~8 | "AUPRC 0.556 vs 0.519" (same binding double-flagged) |
| Sentence-boundary cross-pairing | ~5 | "X scored 0.291. The floor is 0.374" |
| Genuinely ambiguous (cross-detector list constructions) | ~4 | "0.293 versus 0.364 for the frozen probe and 0.291 for TF-IDF" |

The first four categories share a structural property: they're
narrative-prose context cues that the v1.1.0 positional heuristic
can't read. ADR 0005's "Future work (deferred)" section named them
explicitly. The fifth category (cross-detector list grammar) is a
deeper parser-level problem.

### Upstream design — `/exploring-options` 4 rounds + ultrathink redirect

| Time (UTC) | Event |
|---|---|
| ~20:30Z | `/exploring-options` Round 1: which categories to address in v1.2.0? Decision: **Full T1+T2+T3+T4** over the Explore agent's recommended T1+T2-only. T3 and T4 are cheap (~70 LOC combined) and address ADR 0005-named deferred work. |
| ~20:45Z | `/exploring-options` Round 2: API placement. Decision: **all four filters bundled under `scope='narrative'`**; backward compat for `scope='all'` preserved exactly. |
| ~21:00Z | `/exploring-options` Round 3: sentence-boundary detection rigor. Decision: **paragraph-aware abbreviation guard** (vs./e.g./i.e./etc./cf./fig./eq./pp./viz./ca. excluded; single `\n` soft, `\n\n` hard; decimal + letter-dot-letter patterns guarded). Rejected naive `α`/`β` (over-splits multi-line academic prose); rejected `γ` library-based (nltk/spacy too heavy). |
| ~21:10Z | `/exploring-options` Round 4: keyword list configurability. Decision: **hardcoded module-level `frozenset` constants** (`_DELTA_KEYWORDS`, `_FLOOR_KEYWORDS`). No new public kwargs. |
| ~21:20Z | **v1.2.0 ships** — commit `40b3741`, PyPI live, GH release published, `Verify PyPI receipt` step ✓. End-to-end from R12 closure to v1.2.0 ship: ~1 hour. |

### Dogfood evidence (compounded across the cycle)

| Release | Configuration | Warnings on consumer HEAD | Reduction vs v1.0.5 |
|---|---|---|---|
| v1.0.5 | Legacy 2-tuple, no scope filter | 95 | — (baseline) |
| v1.1.0 | `BindingKey` + `scope='narrative'` content-type filter | 23 | -76% |
| **v1.2.0** | + T1–T4 context filters | **7** | **-93%** |

The 7 v1.2.0 residuals are all in the "cross-detector list
construction" category — prose where the validator can't infer
that subsequent values belong to OTHER detectors via list
connectives ("and", "for X", "vs"). T3 only deduplicates the SAME
binding within one sentence; cross-detector list inference is a
parser-level problem.

### Tactical refinements during implementation

Three calibration adjustments emerged from running the test suite +
dogfood:

1. **`_FLOOR_KEYWORDS` narrowed** — initial draft included
   "baseline", "prior", "majority"; removed after the existing
   `test_detector_with_no_nearby_value_skipped` failed on prose
   `"The TF-IDF baseline performs well"` (where "baseline" was a
   detector descriptor, not a floor reference). Replacement
   coverage for the legitimate floor cases is provided by T1's
   "below" keyword (e.g., "below the prevalence baseline of 0.374"
   is caught by "below").
2. **T1 keyword window changed to before-only** — initial draft
   used symmetric ±30 chars; the combined-dogfood test failed on
   `"frozen probe's 0.515 (delta -0.132)"` because "delta" (AFTER
   0.515) suppressed 0.515. Before-only window correctly fires
   T1 on `-0.132` (preceded by "delta") and leaves 0.515 alone.
3. **`_DELTA_KEYWORDS` calibrated** — dropped "against"
   (suppressed legitimate "LoRA's AUROC is 0.383 against frozen
   probe's 0.515" matches). Kept "vs"/"versus" (canonical delta
   separator). Excluded "above"/"ahead"/"behind" (too ambiguous).

These calibrations are captured in the keyword-list comments in
`audit_value_bindings.py` so future maintainers see the rationale.

### Round 13 outcome

- ✅ **v1.2.0 shipped** (commit `40b3741`, PyPI live).
- ✅ **ADR 0005 amended**: "Future work (deferred)" section notes
  the v1.2.0 partial closure; sentence-boundary + 3 related
  context-aware filters resolved; cross-detector list grammar
  remains v1.3.0+ territory.
- ✅ **Consumer adoption path clear**: re-pin
  `eval-toolkit>=1.2.0,<2`; HARD-gate promotion of
  `audit_value_bindings` becomes credible (7 residual < 10
  actionable threshold).
- ✅ **Memory entry updated**: `feedback_validator_identity_plus_scope`
  extended with the v1.2.0 sequel — the same "identity + scope"
  framing accommodates the new context-aware filters as scope
  extensions, vindicating the original two-layer architecture
  decision.

### Multi-LLM audit cadence after R13

Unchanged from R11/R12: no multi-LLM cross-review for
consumer-feedback rounds. The R13 design was informed by the
user-driven `/exploring-options` discipline (4 rounds + tactical
calibration during implementation), substituting for multi-LLM
dispatch for this class of incremental refinement. Future Round 14+
would re-engage multi-LLM dispatch only if a v2.0 design cycle
opens or a major-severity consumer finding requires it.

### Cross-references

- ADR 0005 — "Future work (deferred)" section now reflects v1.2.0
  closure status.
- v1.2.0 CHANGELOG entry — full per-filter detail (T1–T4 keyword
  lists, window calibration, dogfood numbers).
- `tests/test_audit_value_bindings.py` — 36 tests (28 from
  v1.1.0 + 8 new for T1–T4 + sentence-boundary unit test).

---

## Round 14 (2026-05-26) — Layer 3 pairing rules close cross-detector list-grammar (#81)

**Not a multi-LLM gate-style audit.** Fourth consumer-feedback-driven
round in the post-v1.0 cycle (R11 → R12 → R13 → R14). The v1.2.0 cycle
(Round 13) reduced consumer noise from 96 → 4 warnings via the T1–T4
context-aware narrative filters; the residual 4 were all cross-detector
list-grammar / metric-axis confusion cases that the two-layer
identity-+-scope model couldn't resolve. Consumer filed
[#81](https://github.com/brandon-behring/eval-toolkit/issues/81)
documenting the 3 prose patterns + proposing a `_LIST_CONNECTIVES`
extension; upstream v1.3.0 ships **Layer 3 (pairing rules)** per
[ADR 0006](adr/0006-pairing-rules-for-cross-detector-list-grammar.md)
as the architectural closure.

### Motivating residuals from v1.2.0 dogfood

Consumer's v1.3.12 dogfood (eval-toolkit v1.2.0 adopted; 36 warnings →
narrowed by their additional SKIP_PATTERNS to 4):

| File:Line | Pattern category | Prose snippet |
|---|---|---|
| `WRITEUP_PAPER.md:304` | "for X" postfix (Pattern A) | `"versus 0.364 [...] for the frozen probe and 0.291 [...] for TF-IDF + LR"` |
| `RESULTS.md:171` (×2) | possessive + metric confusion (Patterns B + D) | `"LoRA's pooled OOD AUROC is 0.383 against frozen probe's 0.515"` |
| `README.md:71` | group subject (Pattern C) | `"0.38 AUROC, ~0.6 drop for the trained detectors"` |

### Upstream design — `/exploring-options` 2 rounds + Pattern D emerged during dogfood

| Time (UTC) | Event |
|---|---|
| ~21:30Z | Consumer files #81 documenting the 3 prose patterns (A/B/C) + proposing `_LIST_CONNECTIVES` design. |
| ~22:00Z | `/exploring-options` Round 1: scope. Decision: A + B + C-suppress (not C-inference) — closes all 4 residuals at ~130 LOC; rejected A+B-only (leaves 1) and A+B+C-inference (~250 LOC, MODERATE-HIGH risk). |
| ~22:10Z | `/exploring-options` Round 2: documentation. Decision: new ADR 0006 codifying Layer 3 (pairing rules) as the third correctness layer alongside ADR 0005's identity + scope. |
| ~22:30Z | Implementation reaches first dogfood: 4 → 2 warnings. **Pattern D emerges**: the residual 2 are metric-axis confusion (`"AUPRC delta suggests: AUROC is 0.383"` — wrong metric picked up by proximity). Added as fourth rule (symmetric to detector-axis pairing). |
| ~22:45Z | Three tactical calibrations: (a) Pattern A intervening-value check now uses v1.1.0's `excluded_ranges` (CI brackets don't count as intervening); (b) Pattern B refactored to "last possessive within 30 chars" instead of requiring an "is" verb; (c) Pattern C uses v1.2.0's `_crosses_sentence_boundary` to prevent paragraph-crossing suppression bleed. |
| ~22:55Z | Override-bypass architecture: Pattern A/B now set `pairing_confirmed_pos` to BYPASS the proximity-based detector pairing when they confirm THIS binding's detector (fixed the bug where override + proximity disagreed). |
| ~23:00Z | **v1.3.0 ships** — commit `4c2742e`. Consumer dogfood: **4 → 0 warnings**. Combined v1.0.5 → v1.3.0: 95 → 0 (-100%). |

End-to-end from #81 filing to v1.3.0 ship: ~1.5 hours.

### Layer 3 architecture (per ADR 0006)

Audit validators now have a three-layer correctness model:

| Layer | Dimension | Mechanism | Release |
|---|---|---|---|
| 1 | Identity | `BindingKey` structured keys | v1.1.0 |
| 2 | Scope | Content-type + context-keyword filters (`scope='narrative'`) | v1.1.0 + v1.2.0 |
| **3** | **Pairing** | **Override / suppress proximity-based pairing under grammar cues** | **v1.3.0** |

Four Layer 3 rules ship in v1.3.0 (all under `scope='narrative'`, no new public kwargs):

- **Pattern A**: `"for {detector}"` postfix → override detector pairing (with intervening-value guard via v1.1.0 exclusion-ranges).
- **Pattern B**: `"{detector}'s"` possessive → override detector pairing (last possessive within 30 chars of value).
- **Pattern C**: `"for the {trained|frozen|baseline|all|both|other} detectors"` group subject → suppress candidate (with sentence-boundary guard via v1.2.0 sentence-positions).
- **Pattern D**: metric-axis nearest-pairing (symmetric to detector-axis). Pre-collects ALL metric positions across consumer-supplied `metric_aliases` keys, not just binding-derived metrics.

### Dogfood evidence (compounded across the four rounds)

| Release | Configuration | Warnings on consumer HEAD | Reduction vs v1.0.5 |
|---|---|---|---|
| v1.0.5 | Legacy 2-tuple, no scope filter | 95 | — (baseline) |
| v1.1.0 | BindingKey + scope='narrative' content-type | 23 | -76% |
| v1.2.0 | + T1–T4 context filters | 7 | -93% |
| **v1.3.0** | + Patterns A/B/C/D pairing rules | **0** | **-100%** |

### Round 14 outcome

- ✅ **v1.3.0 shipped** (commit `4c2742e`, PyPI live).
- ✅ **#81 auto-closed** by `closes #81` directive on the v1.3.0 tag.
- ✅ **0 open issues** on the eval-toolkit repo post-#81.
- ✅ **ADR 0006 committed** — Layer 3 pairing rules formally codified
  as the third correctness layer.
- ✅ **Consumer HARD-gate promotion now credible.** With 0 residual
  warnings, the consumer can promote `audit_value_bindings` from
  SOFT to HARD bundled with `audit_citation_alignment` per their
  v1.3.8 plan.

### The post-v1.0 consumer-feedback cycle (R11 → R14)

The four post-v1.0 rounds form a complete consumer-feedback loop:

| Round | Driver | Cycle time | Closure |
|---|---|---|---|
| R11 | Consumer adopts v1.0.x audit-validator family | days | v1.0.4 (3 validators shipped) |
| R12 | Consumer files #80 (BINDINGS slice-axis) | ~2 hours | v1.1.0 + v1.2.0 (`BindingKey` + scope='narrative') |
| R13 | v1.1.0 dogfood surfaces context-filter gaps | ~1 hour | v1.2.0 (T1–T4 context filters; 95 → 23) |
| R14 | Consumer files #81 (cross-detector list-grammar) | ~1.5 hours | v1.3.0 (Layer 3 pairing rules; 4 → 0) |

The pattern is consistent: consumer surfaces a structural gap →
upstream files an ADR-driven design → ships within hours →
consumer adopts same-day. The 100% noise reduction over four
rounds validates the iterative consumer-feedback model as a
substitute for heavyweight multi-LLM gate audits (which closed at
v1.0 per ADR 0003).

### Multi-LLM audit cadence after R14

Unchanged from R11/R12/R13: no multi-LLM cross-review for
consumer-feedback rounds. R14's design was informed by
`/exploring-options` (2 rounds) + Pattern D emergence during dogfood
+ 3 tactical calibrations during implementation. Multi-LLM dispatch
remains reserved for v2.0 design cycles or major-severity findings.

### Cross-references

- [ADR 0006](adr/0006-pairing-rules-for-cross-detector-list-grammar.md) — Layer 3 codification.
- v1.3.0 CHANGELOG entry — full per-pattern detail with consumer adoption path.
- `tests/test_audit_value_bindings.py` — 43 tests (36 from v1.2.0 + 7 for Patterns A/B/C/D + backward-compat + combined dogfood).

---

## Round 15 (2026-05-26) — audit_citation_alignment Layer 2 + Layer 3 + ADR 0007 family-wide architecture

**Not a multi-LLM gate-style audit.** Fifth consumer-feedback-driven
round, opening the equivalent cycle for `audit_citation_alignment`
that R11→R14 closed for `audit_value_bindings`. Consumer filed
[eval-toolkit#82](https://github.com/brandon-behring/eval-toolkit/issues/82)
concurrent with their v1.3.13 adoption: 188 residual warnings on
`audit_citation_alignment`, same architectural-class gap (Layer 2 +
Layer 3 context-awareness) that audit_value_bindings worked through
across three releases.

### Architectural insight: ADR 0007 generalizes the model

ADR 0005 (Layer 1 + 2) and ADR 0006 (Layer 3) were originally
framed as audit_value_bindings-specific. The v1.4.0 cycle adopting
the same architecture for `audit_citation_alignment` revealed that
the three-layer model is the canonical architecture for the
audit-validator family — not just one validator. ADR 0007 codifies
this family-wide.

The v1.4.0 release also extracted shared narrative-prose helpers
into private flat module `eval_toolkit/_narrative.py` (consistent
with ADR 0001's `_rng.py`/`_parallel.py`/`_sweep.py` precedent).
Both validators import from this module; future audit validators
inherit the same primitives.

### Three-layer model applied to audit_citation_alignment

| Pattern | Layer | Mechanism |
|---|---|---|
| **β** | Layer 2 | Exclude citations inside markdown table rows, bracketed expressions, fenced code blocks (reuses v1.1.0 `_build_exclusion_ranges`). |
| **γ** | Layer 3 | Category-keyword extraction window bounded by the SENTENCE containing the citation (reuses v1.2.0 `_sentence_boundary_positions`). |
| **α** | Layer 3 | Multi-category set membership: when the sentence matches multiple category keywords (multi-topic prose), accept the citation if the ADR's actual category is in the set. |
| **None-skip** | Layer 2 refinement | Defer (skip the citation) when `subject.category is None` — the consumer's category map can't classify the ADR. Symmetric to the existing `claim_category is None` skip from v1.0.1. |

### Upstream design — `/exploring-options` 3 rounds + Pattern α generalization + None-skip emerged during dogfood

| Time (UTC) | Event |
|---|---|
| ~21:30Z | Consumer files #82 documenting 3 prose patterns (α/β/γ) + 188-warning baseline. |
| ~22:00Z | `/exploring-options` Round 1: scope. Decision: bundle Path A + Path B in v1.4.0 (one coherent release; patterns established). |
| ~22:10Z | `/exploring-options` Round 2: helper sharing. Decision: extract to private flat module `_narrative.py`. |
| ~22:20Z | `/exploring-options` Round 3: documentation. Decision: write ADR 0007 generalizing three-layer model to family. |
| ~22:30Z | Implementation reaches first dogfood: 188 → 101 (46% reduction). Pattern β + γ + initial α (multi-citation only) firing. |
| ~22:40Z | **Symmetric-None skip emerges**: dominant residual is `actual=None` cases (ADRs the consumer's category map can't classify). Symmetric to existing `claim=None` skip from v1.0.1. Added under `scope='narrative'` → 101 → 44. |
| ~22:50Z | **Pattern α generalized**: extend from "multi-citation sentences only" to "any multi-topic sentence" (any sentence matching ≥2 category keywords). 44 → 37. |
| ~22:55Z | Decision: ship at 37 rather than over-engineer. Above the #82 ≤20 target but a 5× reduction; residual 37 includes real misalignments worth consumer-triage + edge cases requiring parser-level understanding. |
| ~23:00Z | **v1.4.0 ships** — commit `7c40490`. ADR 0007 codifies family-wide architecture. |

End-to-end from #82 filing to v1.4.0 ship: ~1.5 hours.

### Dogfood evidence (audit_citation_alignment cycle)

| Release | Configuration | Warnings on consumer HEAD | Reduction vs v1.3.0 baseline |
|---|---|---|---|
| v1.3.0 (audit_citation_alignment with scope='all') | 188 | — | (baseline) |
| **v1.4.0 (`scope='narrative'`)** | **37** | **-80%** | |

The residual 37 is above the original #82 acceptance criterion
(≤20) but represents a 5× reduction. Categorized:

- **Real misalignments** consumer should triage (e.g., `ADR-025`
  cited for a threshold claim when ADR-025 is the cost ADR —
  could be a wrong-ADR bug or a multi-topic ADR not captured by
  the consumer's category-keyword map).
- **Single-topic-sentence edge cases**: Pattern α only fires for
  multi-topic sentences (≥2 category keywords matched). Sentences
  with one dominant topic + a cross-category ADR cite stay on the
  legacy first-match check.
- **Multi-topic ADRs** where the consumer's category map assigns
  a single category but the ADR genuinely covers multiple
  topics. Consumer-side category-map expansion would address
  these.

### Round 15 outcome

- ✅ **v1.4.0 shipped** (commit `7c40490`, PyPI live).
- ✅ **#82 auto-closed** by `closes #82` directive on the v1.4.0 tag.
- ✅ **0 open issues** on the eval-toolkit repo post-#82.
- ✅ **ADR 0007 committed** — three-layer architecture (identity +
  scope + pairing) codified as canonical for ALL audit_*
  validators. ADRs 0005/0006 now framed as family-wide via 0007.
- ✅ **`_narrative.py` extracted** — shared narrative-prose
  helpers; both validators import from one canonical location.
  Signature-preserving refactor; all v1.3.0 tests pass unchanged.
- ⚠ **Consumer HARD-gate promotion**: still a judgment call. The
  residual 37 includes some real misalignments worth fixing in
  consumer prose; HARD-gating now would block commits on those
  lines. Consumer's v1.3.8 bundled-promotion plan should
  proceed only after triaging the 37.

### The post-v1.0 consumer-feedback cycle (R11 → R15)

The five rounds form a complete library-first cycle for the
audit-validator family:

| Round | Driver | Cycle time | Closure |
|---|---|---|---|
| R11 | Consumer adopts v1.0.x audit-validator family | days | v1.0.4 (3 validators shipped) |
| R12 | Consumer files #80 (BINDINGS slice-axis) | ~2h | v1.1.0 + v1.2.0 |
| R13 | v1.1.0 dogfood surfaces context-filter gaps | ~1h | v1.2.0 |
| R14 | Consumer files #81 (audit_value_bindings list-grammar) | ~1.5h | v1.3.0 (Layer 3) |
| **R15** (this) | **Consumer files #82 (audit_citation_alignment scope+pairing)** | ~1.5h | **v1.4.0 (Layer 2 + 3 + ADR 0007)** |

R11→R15 demonstrates the iterative consumer-feedback model as a
substitute for heavyweight multi-LLM gate audits (closed at v1.0
per ADR 0003). Each round delivered closure within hours of the
consumer filing, with same-day adoption. ADR 0007 captures the
architectural template; future audit validators inherit the
three-layer model and ship in fewer cycles.

### Multi-LLM audit cadence after R15

Unchanged from R11–R14: no multi-LLM cross-review for
consumer-feedback rounds. R15's design was informed by
`/exploring-options` (3 rounds) + 2 dogfood-driven refinements
(Pattern α generalization + symmetric-None skip). Multi-LLM
dispatch remains reserved for v2.0 design cycles or major-
severity findings.

### Cross-references

- [ADR 0007](adr/0007-three-layer-architecture-for-audit-validators.md) — family-wide three-layer codification.
- v1.4.0 CHANGELOG entry — full per-pattern detail with consumer adoption path.
- `tests/test_audit_citation_alignment.py` — 18 tests (12 from v1.0.1 + 6 for v1.4.0 Patterns α/β/γ + backward-compat + shared-helpers + combined dogfood).

---