# Audit findings ledger This ledger tracks findings from each cross-model methodology audit (Gate 3 per [ADR 0003](adr/) — to be drafted at v0.48). Each row records the finding ID, severity, finding summary, disposition (how it was resolved or scheduled), and a link to the tracked issue (where applicable). **Convention**: blocker-severity findings get a `p1-gate3`-labelled GitHub issue for fix-tracking. Lower-severity findings are recorded here only. **Cross-references**: - The audit prompt template is at [`gate3-audit-prompt.md`](https://github.com/brandon-behring/eval-toolkit/blob/main/.claude/plans/gate3-audit-prompt.md) (local — not in published docs). - The v1.0 plan that drives audit cadence is at `~/.claude/plans/evaluate-all-the-work-twinkly-kite.md` (local). - Audit re-run schedule: after each breaking minor (v0.46, v0.47, v0.48) plus the original Round 5 pre-implementation pass. 7-day audit-completion timeout per gate. --- ## Round 5 (2026-05-21) — Codex + Gemini pre-implementation audit **Reviewers**: author (manual) + Codex (independent report) + Gemini (independent report). **Packet**: v0.44.0 code state + the v1.0 release plan (`~/.claude/plans/evaluate-all-the-work-twinkly-kite.md`) + `docs/source/methodology/` (16 chapters) + `docs/source/roadmap.md` + `CHANGELOG.md` + `src/eval_toolkit/*.py` + existing migration guides (`migration/v0.7.md`, `v0.8.md`, `v0.9.md`). **Audit prompt**: [`~/.claude/plans/gate3-audit-prompt.md`](https://github.com/brandon-behring/eval-toolkit/blob/main/.claude/plans/gate3-audit-prompt.md). | ID | Severity | Finding | Disposition | Issue | |-------|-----------------------|------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|-------| | R5-F1 | blocker before v0.46 | `scorecard()` first-party metric list mixes threshold-free + threshold-dependent metrics; `MetricSpec.compute(y, s) -> float` has no threshold input | **Decision R**: drop F1/accuracy/precision/recall from v0.46 specs; keep `metrics_at_threshold` + `ThresholdSelector` as separate path | _(in plan)_ | | R5-F2 | blocker before v0.46 | `Scorecard` result type has no contract for undefined/skipped/error cells; existing `MetricState` infra unused | **Decision S**: reuse `MetricState` (`ok/skipped/error`) vocabulary from `artifacts.py:30-61` | _(in plan)_ | | R5-F3 | blocker before v0.47 | Sweep unification plan assumes `DelimitVariant`/`DatamarkVariant` classes exist; `preprocessing.py` exports only functions | Plan revised: create 3 preprocessing dataclasses at v0.47 as part of sweep work; sweep contract clarified (neutral default; explicit `attack_threshold` required for ASR) | _(in plan)_ | | R5-F4 | blocker before v0.46 | Plan's scalar-deprecation directive would replace the lazy export resolver (`__init__.py:302-312`), shattering all root imports | Plan corrected: extend existing `__getattr__` with a transitional deprecation branch (BEGIN/END markers); v0.47 removes only that branch, base resolver intact | _(in plan)_ | | R5-F5 | blocker before v1.0 | DeLong (`DeLongResult`, `delong_roc_variance`) is publicly exported + in API docs, but methodology + roadmap docs say "out of scope" | **Decision U**: keep public; update `methodology/comparison.md`, `methodology/reading_list.md`, `roadmap.md` to align with shipped state. Bundled into v0.48 §5E-prep | _(in plan)_ | | R5-F6a | packet drift | `cv_clt_ci` docstring claims "Bayle et al. 2020 prove a CV-CLT with a correction factor"; code does naive sample variance (no scalar correction) | Docstring fix bundled into v0.48 §5E-prep. Code is correct per Bayle (2020) Thm 3.1; docstring oversells | _(in plan)_ | | R5-F6b | packet drift | `methodology/parallelism.md:143-181` says "as of v0.34, harness not yet parallelized" + "once #29/#30 land" — but v0.36 wired `evaluate(n_jobs=)` + `evaluate_folded(n_jobs=)` | Doc fix bundled into v0.48 §5E-prep. Also clarify `bootstrap_ci`'s `n_jobs` is studentized-only | _(in plan)_ | | R5-F6c | packet drift | `methodology/testing.md:108-136` says reference-equivalence gap "closing in PR 1.5"; roadmap state shows it's closed | Doc fix bundled into v0.48 §5E-prep | _(in plan)_ | | R5-F6d | packet drift | `methodology/calibration.md:15-18` lists only temperature/isotonic/Platt; Beta + 4-binary-adapter family also shipped | Doc fix bundled into v0.48 §5E-prep | _(in plan)_ | | R5-F6e | packet drift + code | `methodology/bootstrap.md` two-level example uses same array for val + test, causing ~63.2% overlap when `paired_bootstrap_op_point_diff` resamples val/test independently | Doc fix + defensive code guard bundled into v0.48 §5E-prep: rewrite example with disjoint split + raise `ValueError` in `paired_bootstrap_op_point_diff` if `val_y is test_y` | _(in plan)_ | | R5-F6f | partial verified | `CostSensitiveSelector` formula `t* = c_FP·(1-π) / (c_FP·(1-π) + c_FN·π)` is the prior-corrected form; existing docstring already cites Elkan's prior-independent variant — intentional design, but easy to misuse on already-calibrated probabilities | Docstring sharpening (add `Warning` admonition) bundled into v0.48 §5E-prep. Math itself is correct per the documented intent | _(in plan)_ | | R5-F7 | governance risk | Gate 3 redefinition (multi-model cross-review) is useful but not the same evidence class as external academic peer review; the plan understated this | **Decision O revised**: ADR 0003 explicitly states Gate 3 at v1.0 is internal model-assisted cross-review, NOT external academic peer review; documents what it catches and doesn't | _(in plan)_ | | R5-Px | plan internal inconsistency | v0.46 verification smoke imports `TextTransform`, but Decision K introduces it at v0.47 | Plan smoke test corrected — `TextTransform` import removed from v0.46 smoke | _(in plan)_ | **Round 5 totals**: 12 findings (5 blockers, 1 high-pre-v1, 5 packet drift, 1 governance, 1 plan inconsistency). All resolved or scheduled before any v0.46 code shipped. **Key follow-on decisions** (driven by Round 5): - **Decision W**: packet-drift fixes (F6a–F6f) bundle into v0.48 polish (not earlier hotfix). - **Decision X**: scorecard skipped-status detection reuses `is_metric_defined_for_slice` from v0.39.0. - **Decision Y**: Round 6 / 7 / 8 audits run as formal stop-gates between v0.46 / v0.47 / v0.48 / v1.0; 7-day timeout per gate. - **Decision Z**: this ledger file. --- ## Round 6 (complete: 2026-05-21 — Codex + Gemini + manual; 11 findings) **Ship date**: v0.46.0 tagged + published to PyPI 2026-05-21. STOP-GATE per Decision Y.2 — `release/v0.47.0` opens once R6-F1 (BLOCKER) + R6-F2 (HIGH) are fixed via v0.46.1 hotfix and the remaining items are scheduled. **Reviewers**: author (manual) + Codex (independent) + Gemini (independent). **Packet to send to Codex + Gemini**: - This plan file (`~/.claude/plans/evaluate-all-the-work-twinkly-kite.md`) - This audit ledger (`docs/source/audit_findings.md`) - `docs/source/methodology/` (16 chapters — unchanged from Round 5) - `docs/source/roadmap.md` - `docs/source/migration/v0.46.md` (**NEW** — consumer migration guide) - `docs/source/adr/0002-scorecard-as-primary-metric-surface.md` (**NEW**) - `CHANGELOG.md` (new v0.45.0 + v0.46.0 entries since Round 5) - Source tree, with focus on the v0.45 + v0.46 additions: - `src/eval_toolkit/stacking.py` (v0.45, ~370 lines) - `src/eval_toolkit/scorecards.py` (renamed from `_scorecard.py` at v0.49.0; ~585 lines) - `src/eval_toolkit/metric_specs.py` (v0.46, ~180 lines) - `src/eval_toolkit/__init__.py` — `__getattr__` deprecation branch + `_DEPRECATED_SCALARS` (Audit F4 invariant region) - `src/eval_toolkit/metrics.py:120-200` — `SINGLE_CLASS_INCOMPATIBLE_METRICS` extension (Round-5 X.2 precondition) **Audit prompt**: [`~/.claude/plans/gate3-audit-prompt.md`](https://github.com/brandon-behring/eval-toolkit/blob/main/.claude/plans/gate3-audit-prompt.md) (local). The "Known issues already in the v0.48 backlog (skip re-reporting)" section already lists drift items scheduled for v0.48 polish — Round 6 reviewers should skip those and surface only NEW findings against the v0.46 state. **Focus areas** for Round 6 review: - **scorecard surface design lock-in.** The Tier-2 `MetricSpec` Protocol freezes at v1.0 — method-signature changes require a v2.0 major bump. Last cheap chance to catch contract gaps. - **MetricResult cell-state contract** — does the `ok` / `skipped` / `error` vocabulary cover every relevant failure mode? Are the reason strings useful for triage? - **Per-cell error isolation** — confirm that catching all exceptions in `_evaluate_spec` doesn't hide important failures the user should see. - **`__getattr__` deprecation shim** — Audit F4 invariant: does the branch correctly route deprecated names, NOT break non-deprecated resolution, and cleanly delete at v0.47? - **Spec name encoding for parameterized metrics** — is `"ece_n_bins_15_strategy_uniform"` a stable v1.0 commitment, or does the alphabetize-kwargs rule create surprise keys for custom user specs with multi-kwarg signatures? - **`Scorecard.to_pandas()` MultiIndex schema** — first-time-public; any shape lock-in concerns? **Triage on findings**: each blocker → `p1-gate3`-labelled GitHub issue + a row in this ledger. Either fix-as-v0.46.1-hotfix or fold into v0.47 design (per Decision Q severity-tiered hotfix policy). | ID | Reviewer | Severity | Finding | Disposition | Lands | |----|----------|----------|---------|-------------|-------| | R6-F1 | Codex | **BLOCKER** before v0.47 opens | `metric_specs.ece(strategy="typo")` silently dispatches to quantile ECE and returns scorecard cell with `status="ok"` under invalid key (`"ece_n_bins_15_strategy_typo"`). Wrong-by-design data correctness path. Verified via Codex runtime probe. | Add strategy validation in `ece()` factory + `_EceSpec.compute()`; raise `ValueError("ECE strategy must be 'uniform' or 'quantile'; got {strategy!r}")` (plan §2.5A). | **RESOLVED v0.46.1** (commit `7a4bb14`, tag `v0.46.1` 2026-05-21; consumer pin bumped same day) | | R6-F2 | Codex + Gemini | HIGH before v0.47 scalar hard-removal | ECE deprecation warnings in `__init__.py:_scorecard_spec_for()` emit broken migration snippets for all 5 ECE variants. Two-part bug: (a) for the 2 variants in `metric_specs`, the suggested scorecard key uses the factory-call expression (`"ece(n_bins=10)"`) instead of the encoded spec name (`"ece_n_bins_10_strategy_uniform"`); (b) for the 3 variants NOT in `metric_specs` (`_debiased`, `_l2`, `_l2_debiased`), the fallback name isn't an importable spec. Gemini claimed pre-v0.46 default was `n_bins=15` (verified incorrect — code at `metrics.py:730-734` shows `n_bins=10`); Decision R6-F resolves: warning uses `n_bins=10` to preserve pre-v0.46 math + adds migration note about new factory default. | Restructure `_scorecard_spec_for()` to return `(factory_expr, scorecard_key, has_first_party)` tuple; correct snippets for first-party variants with `n_bins=10`; submodule-path template for 3 non-first-party variants per Decision R6-G (plan §2.5B). | **RESOLVED v0.46.1** (commit `7a4bb14`, tag `v0.46.1` 2026-05-21; consumer-side smoke verified all 5 ECE-variant warnings + submodule-path routing) | | R6-F3 | Codex | HIGH before scorecard freeze | Duplicate `MetricSpec.name` values in the same `scorecard()` call silently overwrite earlier cells (last-wins). Not a documented contract. | Decision R6-B (locked): reject in `scorecard()` with `ValueError("Duplicate MetricSpec name 'X' at index N; ...")`. Forces caller to disambiguate; no silent data loss. (Plan §4G.) | **v0.47** | | R6-F4 (= Gemini R6-F1) | Codex + Gemini | HIGH before v1.0 | `scorecard(seed=None)` documented as non-deterministic; implementation coerces `None → 0`. Doc/impl contradiction. Verified by Codex via bit-for-bit equality test. | Decision R6-A (locked): deterministic-by-default; fix docs only. No behavior change. Plan §4G-prep. (Decision R6-E: rolls to v0.47 — R6-A is non-blocker per Decision Q's "docstring" category.) | **v0.47** | | R6-F5 | Codex | Contract-enforcement gap before v1.0 | ADR 0003 promises strict Tier-2 Protocol method-shape stability; current public-API drift guard only snapshots `(*args, **kwargs)` for Protocol classes, not method signatures. The guard does not see changes to `MetricSpec.compute`, `MetaLearner.fit`, etc. | Decision R6-D (locked): extend `tests/test_public_api.py` snapshot to capture Protocol method signatures via `inspect.signature` + `typing.get_type_hints` for the 9 Tier-2 Protocols. (Plan §4I.) | **v0.47** | | R6-F6 | Codex | Packet drift | v1.0 plan + roadmap still describe pre-v0.46 scorecard shapes that didn't ship: `ece_n_bins_15` without strategy in plan, `ece_quantile()` factory listed (shipped as `ece(strategy='quantile')`), `MetricUndefinedError` mentioned (ADR 0002 chose no new public exception), `n_resamples >= 100` floor (shipped is `>= 1`). Roadmap "Currently shipped" still says v0.44. | Plan §4L: refresh plan §3A scorecard examples + roadmap shipped-state section. Doc-only commit on v0.47 release branch. | **v0.47** | | R6-F3 (Gemini) | Gemini | MEDIUM (schema lock-in before v1.0) | `Scorecard.to_pandas()` MultiIndex columns expose `value, status, reason, ci_low, ci_high, confidence` but drop `n_resamples` + `method` from `BootstrapCI`. Provenance loss compared to `to_dict()`. v1.0 is about to lock the schema. | Decision R6-C (locked): add `n_resamples` + `method` columns at v0.47 (additive). Schema becomes lossless against `to_dict()`. (Plan §4H.) | **v0.47** | | R6-F4 (Gemini) | Gemini | LOW | `MetricSpec` Protocol doesn't enforce stable parameterized-spec naming. Custom users implementing multi-kwarg parameterized specs can silently spawn distinct dict keys if constructor arg order varies. | Decision R6-H (locked): add `make_spec_name(prefix, **kwargs)` canonicalization helper in `metric_specs.__all__` only (NOT top-level `_EXPORTS` — Tier-2 additive contract). Alphabetized kwargs, snake_cased, joined by underscore. (Plan §4J.) | **v0.47** | | R6-F5 (Gemini) | Gemini | LOW | `_evaluate_spec()` wraps `spec.compute()` in broad `except Exception`. Swallows `MemoryError`, `RecursionError`, `KeyboardInterrupt`, `SystemExit` into cell state — process exhaustion / user-interrupt signals get hidden as metric errors. | Narrow exception catch: `except (MemoryError, RecursionError, KeyboardInterrupt, SystemExit): raise` first, then existing broad catch. (Plan §4K.) | **v0.47** | **Round 6 totals**: 11 findings (Codex 6 + Gemini 5; 2 overlap on `seed=None` + ECE deprecation snippets but with different reasoning angles). 1 BLOCKER (R6-F1) + 5 HIGH + 2 MEDIUM/contract + 3 LOW. All dispositioned to either v0.46.1 (2 fixes) or v0.47 (9 fixes). **Key follow-on decisions** (driven by Round 6 — locked in plan): - **Decision R6-A**: `seed=None` deterministic-by-default; fix docs only. - **Decision R6-B**: Reject duplicate `MetricSpec.name` with `ValueError`. - **Decision R6-C**: Add `n_resamples` + `method` to `to_pandas()` schema. - **Decision R6-D**: Extend public-API snapshot to cover Protocol method signatures. - **Decision R6-E**: v0.46.1 scope = R6-F1 + R6-F2 only; R6-A rolls to v0.47 (non-blocker per Decision Q's "docstring" category). - **Decision R6-F**: Use `n_bins=10` (pre-v0.46 default) in deprecation warnings + migration note about new v0.46+ factory default of `n_bins=15`. Corrects Gemini's misverified pre-v0.46 default claim. - **Decision R6-G**: 3 ECE variants without `metric_specs` (debiased, l2, l2_debiased) route deprecation warnings to submodule path; do NOT add to `metric_specs` at v0.47. - **Decision R6-H**: `make_spec_name()` helper in `metric_specs` submodule only; not top-level. ### Round 6 v0.46.1 ship status (2026-05-21) - **R6-F1** ✅ SHIPPED in v0.46.1 (PR #67, squash `7a4bb14`). End-to-end verified in consumer: `ms.ece(strategy="typo")` raises `ValueError`; direct `_EceSpec(strategy=...)` construction also raises (defence-in-depth). - **R6-F2** ✅ SHIPPED in v0.46.1 (PR #67, squash `7a4bb14`). End-to-end verified in consumer: `eval_toolkit.expected_calibration_error` warning carries `ece(n_bins=10)` + key `ece_n_bins_10_strategy_uniform` + migration note about v0.46+ `n_bins=15` default. All 3 non-first-party variants (debiased, l2, l2_debiased) route to `from eval_toolkit.metrics import …` submodule path. - **9 other Round 6 items** (R6-A docstring, R6-B duplicate-name guard, R6-C to_pandas schema, R6-D Protocol method-shape snapshot, R6-F4-Gemini `make_spec_name`, R6-F5-Gemini narrow `except`, R6-F6 plan/roadmap state-drift) → folded into `release/v0.47.0` per Decision R6-E. - **Round 6 STOP-GATE status**: CLOSED. `release/v0.47.0` can open after the v0.46.1 consumer cycle observation completes (1 cycle). --- ## Round 7 (complete: 2026-05-21 — Codex + Gemini; 3 substantive findings) **Reviewers**: author (manual) + Codex (independent report) + Gemini (independent report). **Packet**: v0.47.0 code state + the v1.0 plan + `docs/source/methodology/` (16 chapters) + ADRs 0001/0002/0003 + `docs/source/migration/v0.46.md` + `docs/source/migration/v0.47.md` + Round 5/6 ledger. **Round-7 briefing**: `gate3-audit-round-7.md` (committed `a9e1114`). **Reports**: `gate3-audit-round-7-codex-report.md` + `gate3-audit-round-7-gemini-report.md` (untracked per `.gitignore`). **Headline**: Codex 3 substantive findings; Gemini 0. Overlap was zero between the two reports — the most consequential finding (R7-F1 doc-migration boundary gap between Sybil-tested fences and MyST-NB-executed example notebooks) was Codex-only. Reinforces the Round 6 pattern (do not use overlap as a confidence floor; single-reviewer findings can be the most critical). | ID | Reviewer | Severity (their words) | Finding | Disposition | Lands | |----|----------|------------------------|---------|-------------|-------| | R7-F1 | Codex | high before v0.48 | v0.47 doc migration missed MyST-NB executable example notebooks (separate from Sybil-collected `.md` fences). 6 example pages + 4 module-level docstrings + `protocols.md` autosummary + roadmap wording still reference removed APIs. Docs CI runs `sphinx-build` without `-W`, so notebook execution failures pass as advisory warnings. Verified via `sphinx-build` runtime probe — 6 execution failures buried in the warning stream. | Decision R7-A (locked at /exploring-options Q3): bundle into v0.48 §5G/§5H. §5G migrates the 6 notebooks + 4 docstrings + autosummary + roadmap; §5H enables `nb_execution_raise_on_error = True` in `conf.py`. Audit-as-seed expansion (Q2 locked full sweep) covered ALL module docstrings + drift in 5 existing `api/*.md` autosummary lists + 8 missing `api/*.md` pages. | **RESOLVED v0.48.0** (§5G commit `e07db16` + §5H commit `6349472` on `release/v0.48.0`) | | R7-F2 | Codex | high before sweep freezes | `sweep()` records only `strategy.name` per row; two configured instances of same dataclass (e.g., `DelimitVariant(delimiter="<<")` + `DelimitVariant(delimiter="[[")`) silently merge under `groupby("variant")`. Style-coherent defect class with Round 6 R6-F3 (scorecard duplicate name) but with different semantics (row container vs. Mapping). | Decision R7-B option C (locked): emit `strategy_id` canonical column AND reject duplicate `strategy_id` at sweep boundary. Style invariants 1 (no silent failures) + 2 (natural call pattern is right) + 4 (canonical identifier + reject in canonical dimension) read together. | **RESOLVED v0.48.0** (§5I commit `f454afe`) | | R7-F3 | Codex | worth fixing before v1.0 | `sweep()` doesn't validate scorer output cardinality. Three failure modes via runtime probe: overlong 1-D → silent truncation (worst); short 1-D → IndexError later; (n,2) matrix → TypeError when `float()` applied. | Decision R7-C (locked): API-level `ValueError` with contextual label at the sweep boundary; replaces all three low-level failure modes. Style invariants 1 + 3. | **RESOLVED v0.48.0** (§5J commit `fcf99f0`) | ### Gemini observations (Round 7) Gemini's report verdict was "highly stable; release/v0.48.0 is safe to open." Six minor observations / validations; nothing critical that Codex hadn't covered. The actionable items folded into v0.48: - §1-3 + 5-7: VALIDATIONS of v0.47 shipped state (`TextTransform` shape, shim removal, sweep design, R6-D Protocol method-shape snapshot, ADR 0003 tiers). No action needed. - §4 (pedagogical drift): Gemini noted "from eval_toolkit.metrics import pr_auc" is syntactically green but slightly undermines ADR 0002. v0.48 §5G migration explicitly chose `scorecard()` for example notebooks teaching METRIC USAGE; submodule path only where teaching the underlying math. - §4 (Makefile pre-push): Gemini recommended hardening to prevent the `pytest tests/` path-override trap. Landed as v0.48 §5L (`make pre-push` target running all 3 doc-execution surfaces; commit `9878a54`). - §5 (R6-C dtype coercion): Gemini noted `n_resamples` (int + NaN) → `float64` is an accepted tradeoff. Landed as v0.48 §5K (Notes section on `Scorecard.to_pandas()` docstring; commit `6304cea`). - §6 (SynonymSubstitution whitelist): Gemini recommended adding a docstring note about the hardcoded 6-entry whitelist. Landed as v0.48 §5K (`adversarial.py` Notes section; commit `6304cea`). ### Audit-as-seed extensions (v0.48) Per user direction during plan refinement ("use the audits as seeds for things to reconsider"), the Round 7 findings + style-invariants framing surfaced additional v0.48 scope beyond Codex's explicit list: - **§5G expansion**: from 4 Codex-flagged module docstrings to full sweep across `src/eval_toolkit/` module docstrings + audit of all `docs/source/api/*.md` autosummary pages. Found 8 missing API pages + 5 drifted autosummary lists. - **§5M new**: in-source docstring drift audit (third doc-execution surface). Result: 82 PASS / 1 skipped / 0 fail; expanded `.doctest-modules` from 11 → 21 modules so CI catches future drift. - **§5N comprehensive**: cross-API shape-validation consistency sweep beyond Codex's R7-F3 target. Audited `metrics_at_threshold`, `paired_bootstrap_op_point_diff`, `bootstrap_metric_from_predictions`, `metrics.py` scalars, `fit_*_binary` / `fit_*_calibrator`. Tightening commit landed for `metrics_at_threshold` silent threshold semantics (commit `76773dc`); `paired_bootstrap_op_point_diff` `val_y is test_y` guard landed as part of §5E-prep code-side fix (commit `5c8e68d`). ### Round 7 ship status - **3 substantive Codex findings**: all RESOLVED in v0.48.0 via §5G + §5H + §5I + §5J. - **6 Gemini observations**: all RESOLVED in v0.48.0 via §5G + §5K + §5L. - **Audit-as-seed extensions** (§5G expanded, §5M new, §5N comprehensive): all RESOLVED in v0.48.0. - **Round 7 STOP-GATE status**: CLOSED via v0.48.0 release. Round 8 audit STOP-GATE per Decision Y.2 opens against the v0.48.0 state before `v1.0.0` tag can land. --- ## Round 8 (2026-05-24) — STOP-GATE before v1.0 tag, rectified in v0.51.0 Multi-LLM cross-review fired against the v0.50.0 state. **Codex** returned 10 findings (3 with reproducible probes: C3, C4(a)/(b), C9); **Gemini** returned 5 findings + 3 positive validations (V1 style, V2 methodology, V3 scorer_error safety). Independent verification by Claude (`audit-verification-codex-gemini-v0.50.0.md`, 510 lines, untracked per `.gitignore`) confirmed 13 of 18 items, refuted 3 (R8-G2, R8-G5, R8-V1+R8-V2 paired-as-one over-confident Gemini validations), and deferred 2 (R8-G3, R8-G4) to v1.x as Tier-2 additive. The v0.51.0 release ships fixes for all 13 confirmed items; 2 (G3, G4) are deferred to v1.x as Tier-2 additive; 3 refuted entries are recorded below for the audit trail. Round 9 multi-LLM cross-review runs against the v0.51 RC before v1.0 tags. ### Confirmed (13) — fixed in v0.51.0 | ID | Severity | Finding | Disposition | Commit | |----|----------|---------|-------------|--------| | R8-C3 | blocker (escalated from Codex fix-rec) | `recall_at_fpr` fallback violated FPR ceiling (`actual_fpr=1.0` when `target_fpr=0.0`); probe `y=[0,1], scores=[1.0,1.0]`. | v0.51 sentinel `threshold=np.inf, actual_fpr=0.0, fp=0`. | `672d45f` | | R8-C4(a) | blocker (escalated) | Generator-rng not parallel-stable in `_score_all_slices`; n_jobs=1 vs n_jobs=2 produced different CIs; SPEC 7 contract violation. | v0.51 spawns child SeedSequences at the dispatch boundary. | `61964f6` | | R8-C4(b) | blocker (escalated) | `spawn_seed_sequences` ignored Generator state (extracted `bit_generator.seed_seq`). | v0.51 draws fresh entropy via `rng.integers(...)`. | `87453f6` | | R8-C1 | fix-recommended | `evaluate_folded(seeds=)` reused the splitter across the seed loop — partitions never varied. | v0.51 adds `reseed_splitter` callback + DeprecationWarning when multi-seed + None. Warning persists past v1.0. | `f60d43c` | | R8-C2 | fix-recommended | `SourceDisjointKFoldSplitter.iter_folds` yielded empty test partitions when k > n_sources; `get_n_splits` returned `min(k, n_sources)`. | v0.51 caps iter_folds at `min(k, n_sources)` + UserWarning. | `f60d43c` | | R8-C5 | fix-recommended | README links to `docs/...` (broken); migration toctree listed only v0.7–v0.9 despite v0.49–v0.51 BREAKING releases. | v0.51 repoints all README links to `docs/source/...`; toctree + MIGRATION.md index extended through v0.51; v0.49/v0.50/v0.51 migration guides authored. | `c206b54` | | R8-C6 | fix-recommended | `calibration.reliability_curve`, `maximum_calibration_error` did not validate `y_score ∈ [0,1]`; `fit_temperature` did not validate `bounds`. | v0.51 Tier-2 additive validation matching `metrics.py`-side ECE rigor. | `4c5e140` | | R8-C7 | fix-recommended | Tests covered counts where semantics matter: `test_harness_folded.py:92-107` no partition-content check; `test_harness_parallelism.py:77-122` no Generator-rng; `test_protocol_conformance.py:228-240` no k>n_sources. | v0.51 adds semantic-property regression tests bundled with C1/C2/C4 fixes. | `87453f6` + `61964f6` + `f60d43c` | | R8-C8 | minor | `SimilarityStrategy` listed as Tier-2 in README + `extending.md` but absent from `strict_tier2_protocols.md`'s 9-strict list. | v0.51 demotes SimilarityStrategy to "pre-v0.7 internal interface" in README + extending.md (aligns to v1.0 contract per ADR 0003). | `c206b54` | | R8-C9 | minor | `GateResult.to_dict()` returned numpy + NaN unchanged; `json.dumps(..., allow_nan=False)` raised TypeError on the result. | v0.51 docstring documents the JSON-safety contract — strict-JSON requires `artifacts.write_json_strict` or `sanitize_for_json`. | `c206b54` | | R8-C10 | minor | `.gitignore` patterns covered only `gate3-audit-*-report.md`; new `codex-comprehensive-audit-*-report.md` + `audit-verification-*.md` were tracked by default. | v0.51 `.gitignore` extended with new patterns. | `c206b54` | | R8-F1 | fix-recommended (Claude — missed-by-both) | `losses.RecallAtLowFPR.__init__` did not validate `pos_weight > 0`; non-positive values produced degenerate-but-bounded loss values silently. | v0.51 eager validation matching sibling-kwarg pattern. | `4c5e140` | | R8-F2 | minor (Claude — missed-by-both) | `metric_specs.ece(n_bins=)` validated `strategy` eagerly but deferred `n_bins` to compute time. | v0.51 eager `_validate_n_bins` call at factory level. | `4c5e140` | | R8-F3 | minor (Claude — missed-by-both) | `analysis.CsvPredictionReader.read_predictions` silently filled missing CSV columns with empty strings → cryptic dtype error downstream. | v0.51 detects missing columns at read time; raises actionable `ValueError`. | `4c5e140` | ### Refuted (3) — recorded for audit trail; NO fix shipped | ID | Severity (as auditor marked it) | Finding (rejected) | Rationale | |----|---------|---------|-----------| | R8-G2 | fix-recommended (Gemini) | "Cyclic import leakage→harness identified but left unresolved." | REFUTED: `harness.py:709-712` resolves the cycle via TYPE_CHECKING + lazy `noqa: PLC0415` import; comment explicitly states the resolution. Cycle is deliberately broken by an asymmetric-by-design pattern, not unresolved. | | R8-G5 | minor (Gemini) | "Plotting tests use weak assertions (`assert fig is not None`)." | REFUTED: `tests/test_plotting_edge.py:125-140` does have that weak assertion, but the same file has structural assertions (line 92 `assert out is fig`; lines 220, 230 `.get_title()` equality). Cherry-picked; not representative. | | R8-V1 | "Style + consistency: Exceptional" (Gemini) | Validation that the axis is exceptional. | REFUTED: Gemini's specific tooling claim (Ruff + Black + Mypy strict compliance) is true, but the axis is broader. Codex C1 + C5 + C6 + C7 + C8 + R8-G3 confirm substantive cross-API / docs / contract consistency drifts. Tooling-clean ≠ axis-exceptional. | | R8-V2 | "Domain methodology: Masterclass" (Gemini) | Validation that methodology is masterclass. | REFUTED: Codex C1 + C3 + C4 + C6 confirm methodology bugs at fix-recommended-or-blocker severity. Methodology has real flaws; "Masterclass" overstates. Gemini's evidence (research-grounded tests exist) conflates research-citation discipline with methodology correctness. | | R8-V3 | "scorer_error catches predictions safely" (Gemini) | Validation that the mechanism is safe. | CONFIRMED, recorded for completeness: `harness.py:468-474` correctly re-raises `MemoryError` + `AssertionError` BEFORE the broad `except Exception`, so critical exceptions escape the silent-capture path. | ### Deferred to v1.x (2) | ID | Severity | Finding | Disposition | |----|----------|---------|-------------| | R8-G3 | fix-recommended (Gemini) | Codebase uses `ValueError` for every error condition (295 raises in src/; 0 custom exception classes). | Deferred to v1.x as Tier-2 additive. Custom exception hierarchy subclassing ValueError preserves catch-compatibility; adding it post-v1.0 is non-breaking. Reconsider when a downstream consumer requests it. | | R8-G4 | minor (Gemini) | joblib OOM hazard documented at `_parallel.py:55-59` but not structurally mitigated (no memory-aware `n_jobs` capping in `harness.evaluate`). | Deferred to v1.x. Memory-aware capping needs RAM-measurement + dataframe-size accounting — non-trivial, no clear best-practice. Caller is responsible for `n_jobs` sizing under their RAM budget at v1.0. | ### Round 8 ship status - **13 confirmed findings**: all RESOLVED in v0.51.0 via the per-finding commits above (Phase 1 + 2 + 3 + 4 on `release/v0.51.0`). - **3 refuted findings** (R8-G2 cyclic-import framing, R8-G5 cherry-picked weak test, R8-V1+R8-V2 over-confident Gemini validations paired-as-one): recorded above; no shipped fix needed. - **2 deferred findings** (R8-G3 custom exceptions, R8-G4 joblib OOM capping): Tier-2 additive; v1.x or later. - **Tally**: 13 + 3 + 2 = 18 total (canonical; aligned with CHANGELOG.md v0.51 section + migration/v0.51.md per RC4 reconciliation). - **Round 8 STOP-GATE status**: CLOSED via v0.51.0 ship. **Round 9 audit STOP-GATE** per Decision Y.2 opens against the v0.51.0 RC before `v1.0.0` tag can land. --- ## Round 9 (2026-05-24) — STOP-GATE before v1.0 tag Round 9 multi-LLM cross-review fired against the `release/v0.51.0` RC (7 commits at the time of audit; PR #75 draft, CI green). **Codex** returned 4 substantive findings (RC1-RC4) + self-validation worklog (69/69 v0.51 regression tests pass); explicitly skipped the 10 modules neither Round-8 auditor cited + 4 cross-cutting hunts. **Gemini** returned 6 items (RG1-RG6): 3 design challenges (2 escalating already-locked v0.51 decisions to v1.0 blockers) + 3 positive validations; zero probes, zero line citations, ~45% validation density. Independent verification by Claude (`audit-verification-round-9-v0.51.0.md`, 228 lines; gitignored per R8-C10) confirmed **6 / refuted 3 / partial 1** of the 10 source-report items, AND surfaced **3 third-audit findings + 3 minors** by hunting the modules Codex skipped. Of those, F-sweep-1 was a CANDIDATE v1.0 BLOCKER (R7-C "no silent failures" invariant violation on NaN scorer output) that neither Round-8 nor Round-9 auditor caught. Per the Path 3 lock from the post-R9 planning session, **two third-audit findings (F-sweep-1 + F-bootstrap-1) ship fixed in this PR**; RC1 already fix-in-PR at `4c43771`; the remaining items defer to v1.0.1. ### Confirmed (6) — disposition mapped | ID | Severity | Finding | Disposition | Commit / Defer-to | |----|----------|---------|-------------|--------------------| | R9-RC1 | fix-recommended (Codex) | README.md:208 has 2 broken `docs/extending.md` + `docs/examples/claims_and_gates.md` hyperlinks — R8-C5 regression (only the API-surface table-cell links; other R8-C5 sites are correct). | Fixed in PR. | `4c43771` | | R9-RC2 | fix-recommended (Codex) | `SimilarityStrategy` contract mismatch: `__init__.py:294` exports it + `snapshot.json:1103-1117` pins `protocol_methods` (de facto strict Tier-2 STRICT contract) BUT `extending.md:18` + `strict_tier2_protocols.md` 9-strict list omit it. Snapshot IS the v1.0 contract regardless of docs. | Deferred to v1.0.1. Two paths: add to strict-Tier-2 list (10 strict + 1 opt-in; ADR 0003 update) OR remove from `__all__` + snapshot. | v1.0.1 | | R9-RC3 | fix-recommended (Codex) | `test_evaluate_folded_reseed_splitter_varies_partitions` at `tests/test_harness_folded.py:134-164` only asserts key existence + count; no row-content comparison across seeds. Docstring claims partitions differ but assertions don't check it. Directly REFUTES Gemini RG6. | Deferred to v1.0.1 test hardening. | v1.0.1 | | R9-RC4 | minor (Codex) | "3 refuted" counts categories (V1+V2 paired as one) while 4 named items listed; defensible across all 4 surfaces but invites confusion. | Deferred to v1.0.1 docs polish. | v1.0.1 | | R9-RG4 | validation (Gemini) | metrics.py `SINGLE_CLASS_INCOMPATIBLE_METRICS` pattern is "elegant". | Validation accurate; pattern at metrics.py:35-99 is well-designed. | — | | R9-RG5 | validation (Gemini) | bootstrap.py defaults match "industry standards". | Validation accurate: `bootstrap.py:66-69` shows N=1000, α=0.95, BCa, seed=42 — all match common conventions. Terminology nit: Gemini used outdated "seed=" instead of v0.50 "rng=". | — | ### Refuted (3) — recorded for audit trail; NO fix shipped | ID | Severity (as auditor marked it) | Finding (rejected) | Rationale | |----|---------|---------|-----------| | R9-RG2 | v1.0 blocker (Gemini) | "DeprecationWarning without sunset date is structurally wrong; v1.0 blocker." | REFUTED: re-litigates Q3-locked v0.51 design. `harness.py:1413-1414` shows the deliberate implementation; lines 1347-1349 + 1368 document "warning persists past v1.0" as chosen contract. Tested by `test_evaluate_folded_multi_seed_without_reseed_emits_deprecation_warning`. Meta-pattern: Gemini escalates without engaging with locked reasoning (same as R8 V1+V2). | | R9-RG3 | v1.0 blocker (Gemini) | "iter_folds capping silently is wrong; should raise." | REFUTED: re-litigates R8-C2 design lock. `splits.py:325-339` shows the deliberate cap-with-UserWarning implementation; warn-vs-raise IS a judgment call; the v0.51 decision was permissive-by-design (caller can opt into `warnings.filterwarnings('error', UserWarning)` for strictness). | | R9-RG6 | validation (Gemini) | "Regression tests apply strict invariant assertions, not weak proxies." | REFUTED — directly contradicted by RC3 which I independently confirmed. `tests/test_harness_folded.py:134-164` is precisely the weak-proxy pattern Gemini claims doesn't exist. Gemini didn't read the cited tests; textbook over-confident validation. | ### Partial (1) | ID | Severity | Finding | Disposition | |----|----------|---------|-------------| | R9-RG1 | accept-with-rationale (Gemini) | `np.inf` sentinel from `recall_at_fpr` risks silent downstream failure. | PARTIAL: internally NO callers (grep across `src/` returned only function def + docstring mentions + an unrelated local var in `losses.py:156`). Sentinel risk is real only at user-code boundary; mitigated by Q1-locked design + docstring's "caller may filter via np.isinf" guidance. Accept-with-rationale at v1.0. | ### Third-audit findings (Claude — modules neither auditor cited) — 3 substantive + 3 minor | ID | Severity | Finding | Disposition | Commit / Defer-to | |----|----------|---------|-------------|--------------------| | R9-F-sweep-1 | **blocker for v1.0** (Claude escalation) | `_sweep.py:_validate_scorer_output()` accepts NaN/inf scorer outputs without validation; R7-C "no silent failures" invariant violation in a module Codex R7 designed but didn't audit for finiteness. Stacking.py validates non-finite scores; sweep.py didn't. | Fixed in PR. NaN/inf check added to R7-C boundary; test bundled. | New commit on `release/v0.51.0` (this PR) | | R9-F-bootstrap-1 | fix-recommended | scipy's BCa returns degenerate CIs (`ci_low == ci_high == point` or NaN bounds) on small n + ceiling/floor metrics; scipy emits DegenerateDataWarning but doesn't raise; pre-v0.51 R8-C4(b) RNG bug may have masked this. | Fixed in PR. UserWarning added at the bootstrap_ci boundary when BCa degenerates; test bundled. | New commit on `release/v0.51.0` (this PR) | | R9-F-bootstrap-2 | minor | `mde_from_ci` width check `if width <= 0` doesn't catch NaN width (NaN <= 0 is False in IEEE float); silent NaN MDE if BCa returns NaN bounds. | Bundled into the F-bootstrap-1 commit. `or not np.isfinite(width)` guard added. | New commit on `release/v0.51.0` (this PR) | | R9-F-metrics-1 | minor (Claude) | `metrics.py:1305` Brier-decomposition docstring claims "≈ 1e-9" precision but actual error ~3e-3 dominated by binning. Self-contradictory. | Deferred to v1.0.1 docs polish. | v1.0.1 | | R9-F-metrics-3 | fix-recommended (Claude) | `expected_calibration_error(y=[0,0,1,1], y_score=[0.5,0.5,0.5,0.5], n_bins=10)` returns 0.0 — technically correct per formula but semantically misleading (uninformative model looks perfectly calibrated). Equal-mass binning correctly rejects n 13` etc). Validator processed `audit_findings.md` + `migration/v0.51.md` + `CHANGELOG.md` without crashing; surfaced 216 pattern-matches + 82 matches. **The high false-positive count is expected** — the synthetic bindings dict has very loose detector ("Round 8") + metric (numeric-count) patterns, and any number near a "Round 8" mention triggers. **NOT real bugs in eval-toolkit's docs** — just the validator successfully exercising the regex+window pipeline at production-scale markdown. The dogfood script is preserved at `.scratch/dogfood_v1_0_x.py` (gitignored) for future re-runs. ### Round 11 outcome - ✅ Action 1 (consumer adoption): consumer PR #10 merged 2026-05-26 01:48Z. Consumer-side dogfood + adoption complete for `validate_citations`. - ✅ Action 2 (this ledger entry): R11 section recorded. - ✅ Action 3 (dogfood expansion): validators exercised on eval-toolkit's own surface; no upstream-side findings. GH issue #77 closed with state_reason=completed. ### Multi-LLM audit cadence after R11 The Gate 3 audit chain (R5–R10) closed at v1.0 per ADR 0003. R11+ is the post-v1.0 consumer-feedback loop and uses NO multi-LLM cross-review by default. Multi-LLM dispatch resumes only if: 1. A v2.0 design cycle opens (would require a Gate 3-equivalent re-audit per ADR 0003 §"v2.0 re-audit triggers"). 2. A consumer files a major-severity finding that single-LLM verification (Claude only) leaves uncertain — in which case Codex + Gemini dispatch via the `audit-prompt.md` template re-engages for that specific question. Otherwise R11+ is light-touch documentation of the consumer-feedback → upstream-validator → consumer-adopt cycle. --- ## Round 12 (2026-05-26) — schema-extensibility + scope-correctness lesson from #80 **Not a multi-LLM gate-style audit.** Round 12 is the second consumer-feedback-driven round, immediately following R11. The R11 audit-validator family (3/3 shipped) revealed a deeper structural gap in `audit_value_bindings` that the v1.0.3 implementation could not address: the 2-tuple `(detector, metric)` canonical-identity schema collapsed across slices, producing ~95 false positives on the consumer's writeup. The fix (v1.1.0) is bigger than the issue's literal ask — it surfaces two architectural rules that now govern all audit validators (pending ADR 0005). ### Motivating finding (consumer audit origin) - **Consumer's v1.3.9 run** (2026-05-26, prompt-injection-detection-submission): `scripts/audit_value_bindings.py` against own writeup produced **96 warnings, ~95 false positives**. The one real bug (`WRITEUP_PAPER.md:545` mis-citing 0.971 inside a misleading paragraph context) was correctly flagged but hid in the noise. Consumer caught it via parallel manual audit, **not** via this validator. Filed [#80](https://github.com/brandon-behring/eval-toolkit/issues/80) at 2026-05-26T18:15:45Z proposing a 3-tuple `(detector, metric, slice)` schema extension. ### Upstream response timeline | Time (UTC) | Event | |---|---| | 2026-05-26T18:15Z | Consumer files #80 (BINDINGS slice-axis schema) | | 2026-05-26T18:30Z | `/exploring-options` Round 1: P0 path for v1.0.4 PyPI gap. Decision: refined bundle (rerun + workflow_dispatch + verify-receipt). | | 2026-05-26T19:00Z | `/exploring-options` Round 2: #80 migration shape. Decision: **Option 4 — `BindingKey` structured key + multi-shape adapter** (rejected Option 1 `(d, m, s)` in-place tuple because positional tuples lock in the recur-every-N-months schema-event pattern). | | 2026-05-26T19:24Z | **v1.0.5 ships** — infrastructure-only release (publish workflow hardening; dress rehearsal for the new Verify PyPI receipt step). | | 2026-05-26T19:30Z | `/exploring-options` Round 3: Step 1 release strategy + verify specs. Decision: v1.0.5 as dress rehearsal; conservative 12×30s PyPI poll. | | 2026-05-26T19:45Z | `/exploring-options` Round 4: ADR 0005 scope + deprecation policy. Decision: medium scope (audit validators only); indefinite acceptance of legacy 2-tuple BINDINGS through v1.x. | | 2026-05-26T20:00Z | First dogfood of v1.1.0 slice-axis fix: only 22% noise reduction (95 → 74). User pushed back: "what is the right long-term solution?" Triggered the deeper rethink. | | 2026-05-26T20:10Z | Identified the **second architectural layer**: scope correctness. ~80% of v1.0.5 residual noise was content-type confusion (CI brackets, table cells, code blocks), not slice-axis confusion. Added `scope='narrative'` filter. | | 2026-05-26T20:11Z | **v1.1.0 ships** — `BindingKey` + slice-aware matching + `scope='narrative'` (closes #80). | End-to-end cycle: #80 filing (18:15Z) → v1.1.0 ship (20:11Z) = **~2 hours compressed cycle**. Four `/exploring-options` rounds + one user-driven architectural redirect ("ultrathink on what the right long-term solution is") produced a coherent two-layer fix instead of the issue's narrower literal ask. ### The two-layer correctness lesson (pending ADR 0005) Audit validators have two correctness layers, not one: 1. **Identity correctness** — canonical measurements have *structured identity* (frozen dataclass with named fields), not positional tuples. Future identity axes (split, ci_kind, source_ref, ...) added as defaulted fields without breaking the dict-key schema. Positional tuples lock the validator into a recur-every-N-months schema-event pattern. 2. **Scope correctness** — the validator should only scan content *plausibly a binding claim*. Narrative prose sentences are. Markdown table cells aren't (they're structured data, audited differently). Bracketed expressions (`[CI 0.286, 0.301]`) aren't (inline expressions, not point estimates). Code blocks aren't. This mirrors lint-design conventions from `ruff`/`mypy`/`bandit` (scope predicates like `# noqa`/`# nosec`/`# type: ignore` are first-class, not optional). ### Dogfood evidence | Configuration | Warnings on consumer HEAD | Reduction vs v1.0.5 baseline | |---|---|---| | v1.0.5 (legacy 2-tuple) | 95 | — (baseline) | | v1.1.0 BindingKey + slice-fix only (scope='all', default `slice_window_chars=120`) | 74 | -22% | | v1.1.0 BindingKey + slice-fix + `scope='narrative'` (recommended) | **23** | **-76%** | The first instinct — "just add a slice axis like the issue asks" — would have left the validator practically unusable on dense academic prose, requiring a separate v1.2.0 cleanup release. The deeper rethink (triggered by the user's redirect) delivered both layers in a single coherent v1.1.0. Residual 23 warnings are pre-existing positional-heuristic limitations (sentence-boundary unawareness, multi-detector list parsing on dense prose) not addressable without parser-level work; deferred to future v1.2.0+ as a narrower issue class. ### Round 12 outcome - ✅ **v1.1.0 shipped** (commit `911565b`, PyPI live). - ✅ **#80 auto-closed** by `closes #80` directive on the v1.1.0 tag. - ✅ **0 open issues** on the eval-toolkit repo post-#80. - ✅ **Architecture lesson codified** in pending [ADR 0005](adr/0005-structured-keys-for-audit-validators.md): the identity + scope two-layer rule applies to future audit validators. - ✅ **Memory entry**: `feedback_validator_identity_plus_scope` captures the "issue body says X is the problem; investigate the dominant noise source before agreeing" discipline. ### Consumer adoption + HARD-gate viability Consumer (`prompt-injection-detection-submission`) currently pins `eval-toolkit>=1.0.3,<2`. Adoption path for v1.1.0: - Smallest diff: replace 2-tuple `BINDINGS` literal with 3-tuple keys (issue body's proposal; works directly). - Recommended: migrate to `BindingKey(detector=..., metric=..., slice=...)` for forward-extensibility. - Add `scope="narrative"` to the validator call. - Re-run; expect ~76% noise reduction. HARD-gate promotion at consumer's v1.3.10+ becomes credible at the ~80% reduction level. Remaining residual false positives can be suppressed via consumer-side filtering (excluding lines containing "random floor" or "versus") or accepted as known low-frequency noise pending v1.2.0+ pairing-rule improvements. ### Multi-LLM audit cadence after R12 Same as R11: no multi-LLM cross-review for consumer-feedback rounds; reserved for v2.0 design cycles or major-severity findings that single-LLM verification cannot resolve. The R12 fix was informed by the user-driven `/exploring-options` discipline (4 rounds + a structural redirect), which substitutes for multi-LLM dispatch for this class of architectural decision. --- ## Round 13 (2026-05-26) — `audit_value_bindings` v1.2.0 context-aware noise reduction **Not a multi-LLM gate-style audit.** Third consumer-feedback-driven round, immediately following R12. R12 closed the slice-axis schema-extensibility gap (BindingKey + scope='narrative' content-type filter) and reduced the consumer's noise floor from 96 → 36 warnings (62%). R13 addresses the residual 36 — the positional-heuristic limitations ADR 0005 named as "Future work (deferred)" — via four context-aware extensions to `scope='narrative'` shipped as v1.2.0. ### Motivating evidence (consumer-side, post-v1.3.11) Consumer (`prompt-injection-detection-submission@v1.3.11`) ran the v1.1.0 `audit_value_bindings` against their writeup and produced 36 residual false positives. Consumer-side categorization in `decisions/upstream_issues.md:92`: | Category | Count | Discriminating prose pattern | |---|---|---| | Random-floor / sub-clause mentions | ~10 | "random floor of 0.374", "baseline of 0.374" | | Delta with sign / keyword | ~9 | "-0.071 AUPRC", "drops -0.132 below" | | Multi-detector "vs"/"versus" enumeration | ~8 | "AUPRC 0.556 vs 0.519" (same binding double-flagged) | | Sentence-boundary cross-pairing | ~5 | "X scored 0.291. The floor is 0.374" | | Genuinely ambiguous (cross-detector list constructions) | ~4 | "0.293 versus 0.364 for the frozen probe and 0.291 for TF-IDF" | The first four categories share a structural property: they're narrative-prose context cues that the v1.1.0 positional heuristic can't read. ADR 0005's "Future work (deferred)" section named them explicitly. The fifth category (cross-detector list grammar) is a deeper parser-level problem. ### Upstream design — `/exploring-options` 4 rounds + ultrathink redirect | Time (UTC) | Event | |---|---| | ~20:30Z | `/exploring-options` Round 1: which categories to address in v1.2.0? Decision: **Full T1+T2+T3+T4** over the Explore agent's recommended T1+T2-only. T3 and T4 are cheap (~70 LOC combined) and address ADR 0005-named deferred work. | | ~20:45Z | `/exploring-options` Round 2: API placement. Decision: **all four filters bundled under `scope='narrative'`**; backward compat for `scope='all'` preserved exactly. | | ~21:00Z | `/exploring-options` Round 3: sentence-boundary detection rigor. Decision: **paragraph-aware abbreviation guard** (vs./e.g./i.e./etc./cf./fig./eq./pp./viz./ca. excluded; single `\n` soft, `\n\n` hard; decimal + letter-dot-letter patterns guarded). Rejected naive `α`/`β` (over-splits multi-line academic prose); rejected `γ` library-based (nltk/spacy too heavy). | | ~21:10Z | `/exploring-options` Round 4: keyword list configurability. Decision: **hardcoded module-level `frozenset` constants** (`_DELTA_KEYWORDS`, `_FLOOR_KEYWORDS`). No new public kwargs. | | ~21:20Z | **v1.2.0 ships** — commit `40b3741`, PyPI live, GH release published, `Verify PyPI receipt` step ✓. End-to-end from R12 closure to v1.2.0 ship: ~1 hour. | ### Dogfood evidence (compounded across the cycle) | Release | Configuration | Warnings on consumer HEAD | Reduction vs v1.0.5 | |---|---|---|---| | v1.0.5 | Legacy 2-tuple, no scope filter | 95 | — (baseline) | | v1.1.0 | `BindingKey` + `scope='narrative'` content-type filter | 23 | -76% | | **v1.2.0** | + T1–T4 context filters | **7** | **-93%** | The 7 v1.2.0 residuals are all in the "cross-detector list construction" category — prose where the validator can't infer that subsequent values belong to OTHER detectors via list connectives ("and", "for X", "vs"). T3 only deduplicates the SAME binding within one sentence; cross-detector list inference is a parser-level problem. ### Tactical refinements during implementation Three calibration adjustments emerged from running the test suite + dogfood: 1. **`_FLOOR_KEYWORDS` narrowed** — initial draft included "baseline", "prior", "majority"; removed after the existing `test_detector_with_no_nearby_value_skipped` failed on prose `"The TF-IDF baseline performs well"` (where "baseline" was a detector descriptor, not a floor reference). Replacement coverage for the legitimate floor cases is provided by T1's "below" keyword (e.g., "below the prevalence baseline of 0.374" is caught by "below"). 2. **T1 keyword window changed to before-only** — initial draft used symmetric ±30 chars; the combined-dogfood test failed on `"frozen probe's 0.515 (delta -0.132)"` because "delta" (AFTER 0.515) suppressed 0.515. Before-only window correctly fires T1 on `-0.132` (preceded by "delta") and leaves 0.515 alone. 3. **`_DELTA_KEYWORDS` calibrated** — dropped "against" (suppressed legitimate "LoRA's AUROC is 0.383 against frozen probe's 0.515" matches). Kept "vs"/"versus" (canonical delta separator). Excluded "above"/"ahead"/"behind" (too ambiguous). These calibrations are captured in the keyword-list comments in `audit_value_bindings.py` so future maintainers see the rationale. ### Round 13 outcome - ✅ **v1.2.0 shipped** (commit `40b3741`, PyPI live). - ✅ **ADR 0005 amended**: "Future work (deferred)" section notes the v1.2.0 partial closure; sentence-boundary + 3 related context-aware filters resolved; cross-detector list grammar remains v1.3.0+ territory. - ✅ **Consumer adoption path clear**: re-pin `eval-toolkit>=1.2.0,<2`; HARD-gate promotion of `audit_value_bindings` becomes credible (7 residual < 10 actionable threshold). - ✅ **Memory entry updated**: `feedback_validator_identity_plus_scope` extended with the v1.2.0 sequel — the same "identity + scope" framing accommodates the new context-aware filters as scope extensions, vindicating the original two-layer architecture decision. ### Multi-LLM audit cadence after R13 Unchanged from R11/R12: no multi-LLM cross-review for consumer-feedback rounds. The R13 design was informed by the user-driven `/exploring-options` discipline (4 rounds + tactical calibration during implementation), substituting for multi-LLM dispatch for this class of incremental refinement. Future Round 14+ would re-engage multi-LLM dispatch only if a v2.0 design cycle opens or a major-severity consumer finding requires it. ### Cross-references - ADR 0005 — "Future work (deferred)" section now reflects v1.2.0 closure status. - v1.2.0 CHANGELOG entry — full per-filter detail (T1–T4 keyword lists, window calibration, dogfood numbers). - `tests/test_audit_value_bindings.py` — 36 tests (28 from v1.1.0 + 8 new for T1–T4 + sentence-boundary unit test). --- ## Round 14 (2026-05-26) — Layer 3 pairing rules close cross-detector list-grammar (#81) **Not a multi-LLM gate-style audit.** Fourth consumer-feedback-driven round in the post-v1.0 cycle (R11 → R12 → R13 → R14). The v1.2.0 cycle (Round 13) reduced consumer noise from 96 → 4 warnings via the T1–T4 context-aware narrative filters; the residual 4 were all cross-detector list-grammar / metric-axis confusion cases that the two-layer identity-+-scope model couldn't resolve. Consumer filed [#81](https://github.com/brandon-behring/eval-toolkit/issues/81) documenting the 3 prose patterns + proposing a `_LIST_CONNECTIVES` extension; upstream v1.3.0 ships **Layer 3 (pairing rules)** per [ADR 0006](adr/0006-pairing-rules-for-cross-detector-list-grammar.md) as the architectural closure. ### Motivating residuals from v1.2.0 dogfood Consumer's v1.3.12 dogfood (eval-toolkit v1.2.0 adopted; 36 warnings → narrowed by their additional SKIP_PATTERNS to 4): | File:Line | Pattern category | Prose snippet | |---|---|---| | `WRITEUP_PAPER.md:304` | "for X" postfix (Pattern A) | `"versus 0.364 [...] for the frozen probe and 0.291 [...] for TF-IDF + LR"` | | `RESULTS.md:171` (×2) | possessive + metric confusion (Patterns B + D) | `"LoRA's pooled OOD AUROC is 0.383 against frozen probe's 0.515"` | | `README.md:71` | group subject (Pattern C) | `"0.38 AUROC, ~0.6 drop for the trained detectors"` | ### Upstream design — `/exploring-options` 2 rounds + Pattern D emerged during dogfood | Time (UTC) | Event | |---|---| | ~21:30Z | Consumer files #81 documenting the 3 prose patterns (A/B/C) + proposing `_LIST_CONNECTIVES` design. | | ~22:00Z | `/exploring-options` Round 1: scope. Decision: A + B + C-suppress (not C-inference) — closes all 4 residuals at ~130 LOC; rejected A+B-only (leaves 1) and A+B+C-inference (~250 LOC, MODERATE-HIGH risk). | | ~22:10Z | `/exploring-options` Round 2: documentation. Decision: new ADR 0006 codifying Layer 3 (pairing rules) as the third correctness layer alongside ADR 0005's identity + scope. | | ~22:30Z | Implementation reaches first dogfood: 4 → 2 warnings. **Pattern D emerges**: the residual 2 are metric-axis confusion (`"AUPRC delta suggests: AUROC is 0.383"` — wrong metric picked up by proximity). Added as fourth rule (symmetric to detector-axis pairing). | | ~22:45Z | Three tactical calibrations: (a) Pattern A intervening-value check now uses v1.1.0's `excluded_ranges` (CI brackets don't count as intervening); (b) Pattern B refactored to "last possessive within 30 chars" instead of requiring an "is" verb; (c) Pattern C uses v1.2.0's `_crosses_sentence_boundary` to prevent paragraph-crossing suppression bleed. | | ~22:55Z | Override-bypass architecture: Pattern A/B now set `pairing_confirmed_pos` to BYPASS the proximity-based detector pairing when they confirm THIS binding's detector (fixed the bug where override + proximity disagreed). | | ~23:00Z | **v1.3.0 ships** — commit `4c2742e`. Consumer dogfood: **4 → 0 warnings**. Combined v1.0.5 → v1.3.0: 95 → 0 (-100%). | End-to-end from #81 filing to v1.3.0 ship: ~1.5 hours. ### Layer 3 architecture (per ADR 0006) Audit validators now have a three-layer correctness model: | Layer | Dimension | Mechanism | Release | |---|---|---|---| | 1 | Identity | `BindingKey` structured keys | v1.1.0 | | 2 | Scope | Content-type + context-keyword filters (`scope='narrative'`) | v1.1.0 + v1.2.0 | | **3** | **Pairing** | **Override / suppress proximity-based pairing under grammar cues** | **v1.3.0** | Four Layer 3 rules ship in v1.3.0 (all under `scope='narrative'`, no new public kwargs): - **Pattern A**: `"for {detector}"` postfix → override detector pairing (with intervening-value guard via v1.1.0 exclusion-ranges). - **Pattern B**: `"{detector}'s"` possessive → override detector pairing (last possessive within 30 chars of value). - **Pattern C**: `"for the {trained|frozen|baseline|all|both|other} detectors"` group subject → suppress candidate (with sentence-boundary guard via v1.2.0 sentence-positions). - **Pattern D**: metric-axis nearest-pairing (symmetric to detector-axis). Pre-collects ALL metric positions across consumer-supplied `metric_aliases` keys, not just binding-derived metrics. ### Dogfood evidence (compounded across the four rounds) | Release | Configuration | Warnings on consumer HEAD | Reduction vs v1.0.5 | |---|---|---|---| | v1.0.5 | Legacy 2-tuple, no scope filter | 95 | — (baseline) | | v1.1.0 | BindingKey + scope='narrative' content-type | 23 | -76% | | v1.2.0 | + T1–T4 context filters | 7 | -93% | | **v1.3.0** | + Patterns A/B/C/D pairing rules | **0** | **-100%** | ### Round 14 outcome - ✅ **v1.3.0 shipped** (commit `4c2742e`, PyPI live). - ✅ **#81 auto-closed** by `closes #81` directive on the v1.3.0 tag. - ✅ **0 open issues** on the eval-toolkit repo post-#81. - ✅ **ADR 0006 committed** — Layer 3 pairing rules formally codified as the third correctness layer. - ✅ **Consumer HARD-gate promotion now credible.** With 0 residual warnings, the consumer can promote `audit_value_bindings` from SOFT to HARD bundled with `audit_citation_alignment` per their v1.3.8 plan. ### The post-v1.0 consumer-feedback cycle (R11 → R14) The four post-v1.0 rounds form a complete consumer-feedback loop: | Round | Driver | Cycle time | Closure | |---|---|---|---| | R11 | Consumer adopts v1.0.x audit-validator family | days | v1.0.4 (3 validators shipped) | | R12 | Consumer files #80 (BINDINGS slice-axis) | ~2 hours | v1.1.0 + v1.2.0 (`BindingKey` + scope='narrative') | | R13 | v1.1.0 dogfood surfaces context-filter gaps | ~1 hour | v1.2.0 (T1–T4 context filters; 95 → 23) | | R14 | Consumer files #81 (cross-detector list-grammar) | ~1.5 hours | v1.3.0 (Layer 3 pairing rules; 4 → 0) | The pattern is consistent: consumer surfaces a structural gap → upstream files an ADR-driven design → ships within hours → consumer adopts same-day. The 100% noise reduction over four rounds validates the iterative consumer-feedback model as a substitute for heavyweight multi-LLM gate audits (which closed at v1.0 per ADR 0003). ### Multi-LLM audit cadence after R14 Unchanged from R11/R12/R13: no multi-LLM cross-review for consumer-feedback rounds. R14's design was informed by `/exploring-options` (2 rounds) + Pattern D emergence during dogfood + 3 tactical calibrations during implementation. Multi-LLM dispatch remains reserved for v2.0 design cycles or major-severity findings. ### Cross-references - [ADR 0006](adr/0006-pairing-rules-for-cross-detector-list-grammar.md) — Layer 3 codification. - v1.3.0 CHANGELOG entry — full per-pattern detail with consumer adoption path. - `tests/test_audit_value_bindings.py` — 43 tests (36 from v1.2.0 + 7 for Patterns A/B/C/D + backward-compat + combined dogfood). --- ## Round 15 (2026-05-26) — audit_citation_alignment Layer 2 + Layer 3 + ADR 0007 family-wide architecture **Not a multi-LLM gate-style audit.** Fifth consumer-feedback-driven round, opening the equivalent cycle for `audit_citation_alignment` that R11→R14 closed for `audit_value_bindings`. Consumer filed [eval-toolkit#82](https://github.com/brandon-behring/eval-toolkit/issues/82) concurrent with their v1.3.13 adoption: 188 residual warnings on `audit_citation_alignment`, same architectural-class gap (Layer 2 + Layer 3 context-awareness) that audit_value_bindings worked through across three releases. ### Architectural insight: ADR 0007 generalizes the model ADR 0005 (Layer 1 + 2) and ADR 0006 (Layer 3) were originally framed as audit_value_bindings-specific. The v1.4.0 cycle adopting the same architecture for `audit_citation_alignment` revealed that the three-layer model is the canonical architecture for the audit-validator family — not just one validator. ADR 0007 codifies this family-wide. The v1.4.0 release also extracted shared narrative-prose helpers into private flat module `eval_toolkit/_narrative.py` (consistent with ADR 0001's `_rng.py`/`_parallel.py`/`_sweep.py` precedent). Both validators import from this module; future audit validators inherit the same primitives. ### Three-layer model applied to audit_citation_alignment | Pattern | Layer | Mechanism | |---|---|---| | **β** | Layer 2 | Exclude citations inside markdown table rows, bracketed expressions, fenced code blocks (reuses v1.1.0 `_build_exclusion_ranges`). | | **γ** | Layer 3 | Category-keyword extraction window bounded by the SENTENCE containing the citation (reuses v1.2.0 `_sentence_boundary_positions`). | | **α** | Layer 3 | Multi-category set membership: when the sentence matches multiple category keywords (multi-topic prose), accept the citation if the ADR's actual category is in the set. | | **None-skip** | Layer 2 refinement | Defer (skip the citation) when `subject.category is None` — the consumer's category map can't classify the ADR. Symmetric to the existing `claim_category is None` skip from v1.0.1. | ### Upstream design — `/exploring-options` 3 rounds + Pattern α generalization + None-skip emerged during dogfood | Time (UTC) | Event | |---|---| | ~21:30Z | Consumer files #82 documenting 3 prose patterns (α/β/γ) + 188-warning baseline. | | ~22:00Z | `/exploring-options` Round 1: scope. Decision: bundle Path A + Path B in v1.4.0 (one coherent release; patterns established). | | ~22:10Z | `/exploring-options` Round 2: helper sharing. Decision: extract to private flat module `_narrative.py`. | | ~22:20Z | `/exploring-options` Round 3: documentation. Decision: write ADR 0007 generalizing three-layer model to family. | | ~22:30Z | Implementation reaches first dogfood: 188 → 101 (46% reduction). Pattern β + γ + initial α (multi-citation only) firing. | | ~22:40Z | **Symmetric-None skip emerges**: dominant residual is `actual=None` cases (ADRs the consumer's category map can't classify). Symmetric to existing `claim=None` skip from v1.0.1. Added under `scope='narrative'` → 101 → 44. | | ~22:50Z | **Pattern α generalized**: extend from "multi-citation sentences only" to "any multi-topic sentence" (any sentence matching ≥2 category keywords). 44 → 37. | | ~22:55Z | Decision: ship at 37 rather than over-engineer. Above the #82 ≤20 target but a 5× reduction; residual 37 includes real misalignments worth consumer-triage + edge cases requiring parser-level understanding. | | ~23:00Z | **v1.4.0 ships** — commit `7c40490`. ADR 0007 codifies family-wide architecture. | End-to-end from #82 filing to v1.4.0 ship: ~1.5 hours. ### Dogfood evidence (audit_citation_alignment cycle) | Release | Configuration | Warnings on consumer HEAD | Reduction vs v1.3.0 baseline | |---|---|---|---| | v1.3.0 (audit_citation_alignment with scope='all') | 188 | — | (baseline) | | **v1.4.0 (`scope='narrative'`)** | **37** | **-80%** | | The residual 37 is above the original #82 acceptance criterion (≤20) but represents a 5× reduction. Categorized: - **Real misalignments** consumer should triage (e.g., `ADR-025` cited for a threshold claim when ADR-025 is the cost ADR — could be a wrong-ADR bug or a multi-topic ADR not captured by the consumer's category-keyword map). - **Single-topic-sentence edge cases**: Pattern α only fires for multi-topic sentences (≥2 category keywords matched). Sentences with one dominant topic + a cross-category ADR cite stay on the legacy first-match check. - **Multi-topic ADRs** where the consumer's category map assigns a single category but the ADR genuinely covers multiple topics. Consumer-side category-map expansion would address these. ### Round 15 outcome - ✅ **v1.4.0 shipped** (commit `7c40490`, PyPI live). - ✅ **#82 auto-closed** by `closes #82` directive on the v1.4.0 tag. - ✅ **0 open issues** on the eval-toolkit repo post-#82. - ✅ **ADR 0007 committed** — three-layer architecture (identity + scope + pairing) codified as canonical for ALL audit_* validators. ADRs 0005/0006 now framed as family-wide via 0007. - ✅ **`_narrative.py` extracted** — shared narrative-prose helpers; both validators import from one canonical location. Signature-preserving refactor; all v1.3.0 tests pass unchanged. - ⚠ **Consumer HARD-gate promotion**: still a judgment call. The residual 37 includes some real misalignments worth fixing in consumer prose; HARD-gating now would block commits on those lines. Consumer's v1.3.8 bundled-promotion plan should proceed only after triaging the 37. ### The post-v1.0 consumer-feedback cycle (R11 → R15) The five rounds form a complete library-first cycle for the audit-validator family: | Round | Driver | Cycle time | Closure | |---|---|---|---| | R11 | Consumer adopts v1.0.x audit-validator family | days | v1.0.4 (3 validators shipped) | | R12 | Consumer files #80 (BINDINGS slice-axis) | ~2h | v1.1.0 + v1.2.0 | | R13 | v1.1.0 dogfood surfaces context-filter gaps | ~1h | v1.2.0 | | R14 | Consumer files #81 (audit_value_bindings list-grammar) | ~1.5h | v1.3.0 (Layer 3) | | **R15** (this) | **Consumer files #82 (audit_citation_alignment scope+pairing)** | ~1.5h | **v1.4.0 (Layer 2 + 3 + ADR 0007)** | R11→R15 demonstrates the iterative consumer-feedback model as a substitute for heavyweight multi-LLM gate audits (closed at v1.0 per ADR 0003). Each round delivered closure within hours of the consumer filing, with same-day adoption. ADR 0007 captures the architectural template; future audit validators inherit the three-layer model and ship in fewer cycles. ### Multi-LLM audit cadence after R15 Unchanged from R11–R14: no multi-LLM cross-review for consumer-feedback rounds. R15's design was informed by `/exploring-options` (3 rounds) + 2 dogfood-driven refinements (Pattern α generalization + symmetric-None skip). Multi-LLM dispatch remains reserved for v2.0 design cycles or major- severity findings. ### Cross-references - [ADR 0007](adr/0007-three-layer-architecture-for-audit-validators.md) — family-wide three-layer codification. - v1.4.0 CHANGELOG entry — full per-pattern detail with consumer adoption path. - `tests/test_audit_citation_alignment.py` — 18 tests (12 from v1.0.1 + 6 for v1.4.0 Patterns α/β/γ + backward-compat + shared-helpers + combined dogfood). ---