Audit findings ledger#

This ledger tracks findings from each cross-model methodology audit (Gate 3 per ADR 0003 — to be drafted at v0.48). Each row records the finding ID, severity, finding summary, disposition (how it was resolved or scheduled), and a link to the tracked issue (where applicable).

Convention: blocker-severity findings get a p1-gate3-labelled GitHub issue for fix-tracking. Lower-severity findings are recorded here only.

Cross-references:

The audit prompt template is at gate3-audit-prompt.md (local — not in published docs).
The v1.0 plan that drives audit cadence is at ~/.claude/plans/evaluate-all-the-work-twinkly-kite.md (local).
Audit re-run schedule: after each breaking minor (v0.46, v0.47, v0.48) plus the original Round 5 pre-implementation pass. 7-day audit-completion timeout per gate.

Round 5 (2026-05-21) — Codex + Gemini pre-implementation audit#

Reviewers: author (manual) + Codex (independent report) + Gemini (independent report).

Packet: v0.44.0 code state + the v1.0 release plan (~/.claude/plans/evaluate-all-the-work-twinkly-kite.md) + docs/source/methodology/ (16 chapters) + docs/source/roadmap.md + CHANGELOG.md + src/eval_toolkit/*.py + existing migration guides (migration/v0.7.md, v0.8.md, v0.9.md).

Audit prompt: ~/.claude/plans/gate3-audit-prompt.md.

ID	Severity	Finding	Disposition	Issue
R5-F1	blocker before v0.46	`scorecard()` first-party metric list mixes threshold-free + threshold-dependent metrics; `MetricSpec.compute(y, s) -> float` has no threshold input	Decision R: drop F1/accuracy/precision/recall from v0.46 specs; keep `metrics_at_threshold` + `ThresholdSelector` as separate path	(in plan)
R5-F2	blocker before v0.46	`Scorecard` result type has no contract for undefined/skipped/error cells; existing `MetricState` infra unused	Decision S: reuse `MetricState` (`ok/skipped/error`) vocabulary from `artifacts.py:30-61`	(in plan)
R5-F3	blocker before v0.47	Sweep unification plan assumes `DelimitVariant`/`DatamarkVariant` classes exist; `preprocessing.py` exports only functions	Plan revised: create 3 preprocessing dataclasses at v0.47 as part of sweep work; sweep contract clarified (neutral default; explicit `attack_threshold` required for ASR)	(in plan)
R5-F4	blocker before v0.46	Plan’s scalar-deprecation directive would replace the lazy export resolver (`__init__.py:302-312`), shattering all root imports	Plan corrected: extend existing `__getattr__` with a transitional deprecation branch (BEGIN/END markers); v0.47 removes only that branch, base resolver intact	(in plan)
R5-F5	blocker before v1.0	DeLong (`DeLongResult`, `delong_roc_variance`) is publicly exported + in API docs, but methodology + roadmap docs say “out of scope”	Decision U: keep public; update `methodology/comparison.md`, `methodology/reading_list.md`, `roadmap.md` to align with shipped state. Bundled into v0.48 §5E-prep	(in plan)
R5-F6a	packet drift	`cv_clt_ci` docstring claims “Bayle et al. 2020 prove a CV-CLT with a correction factor”; code does naive sample variance (no scalar correction)	Docstring fix bundled into v0.48 §5E-prep. Code is correct per Bayle (2020) Thm 3.1; docstring oversells	(in plan)
R5-F6b	packet drift	`methodology/parallelism.md:143-181` says “as of v0.34, harness not yet parallelized” + “once #29/#30 land” — but v0.36 wired `evaluate(n_jobs=)` + `evaluate_folded(n_jobs=)`	Doc fix bundled into v0.48 §5E-prep. Also clarify `bootstrap_ci`’s `n_jobs` is studentized-only	(in plan)
R5-F6c	packet drift	`methodology/testing.md:108-136` says reference-equivalence gap “closing in PR 1.5”; roadmap state shows it’s closed	Doc fix bundled into v0.48 §5E-prep	(in plan)
R5-F6d	packet drift	`methodology/calibration.md:15-18` lists only temperature/isotonic/Platt; Beta + 4-binary-adapter family also shipped	Doc fix bundled into v0.48 §5E-prep	(in plan)
R5-F6e	packet drift + code	`methodology/bootstrap.md` two-level example uses same array for val + test, causing ~63.2% overlap when `paired_bootstrap_op_point_diff` resamples val/test independently	Doc fix + defensive code guard bundled into v0.48 §5E-prep: rewrite example with disjoint split + raise `ValueError` in `paired_bootstrap_op_point_diff` if `val_y is test_y`	(in plan)
R5-F6f	partial verified	`CostSensitiveSelector` formula `t* = c_FP·(1-π) / (c_FP·(1-π) + c_FN·π)` is the prior-corrected form; existing docstring already cites Elkan’s prior-independent variant — intentional design, but easy to misuse on already-calibrated probabilities	Docstring sharpening (add `Warning` admonition) bundled into v0.48 §5E-prep. Math itself is correct per the documented intent	(in plan)
R5-F7	governance risk	Gate 3 redefinition (multi-model cross-review) is useful but not the same evidence class as external academic peer review; the plan understated this	Decision O revised: ADR 0003 explicitly states Gate 3 at v1.0 is internal model-assisted cross-review, NOT external academic peer review; documents what it catches and doesn’t	(in plan)
R5-Px	plan internal inconsistency	v0.46 verification smoke imports `TextTransform`, but Decision K introduces it at v0.47	Plan smoke test corrected — `TextTransform` import removed from v0.46 smoke	(in plan)

Round 5 totals: 12 findings (5 blockers, 1 high-pre-v1, 5 packet drift, 1 governance, 1 plan inconsistency). All resolved or scheduled before any v0.46 code shipped.

Key follow-on decisions (driven by Round 5):

Decision W: packet-drift fixes (F6a–F6f) bundle into v0.48 polish (not earlier hotfix).
Decision X: scorecard skipped-status detection reuses is_metric_defined_for_slice from v0.39.0.
Decision Y: Round 6 / 7 / 8 audits run as formal stop-gates between v0.46 / v0.47 / v0.48 / v1.0; 7-day timeout per gate.
Decision Z: this ledger file.

Round 6 (complete: 2026-05-21 — Codex + Gemini + manual; 11 findings)#

Ship date: v0.46.0 tagged + published to PyPI 2026-05-21. STOP-GATE per Decision Y.2 — release/v0.47.0 opens once R6-F1 (BLOCKER) + R6-F2 (HIGH) are fixed via v0.46.1 hotfix and the remaining items are scheduled.

Reviewers: author (manual) + Codex (independent) + Gemini (independent).

Packet to send to Codex + Gemini:

This plan file (~/.claude/plans/evaluate-all-the-work-twinkly-kite.md)
This audit ledger (docs/source/audit_findings.md)
docs/source/methodology/ (16 chapters — unchanged from Round 5)
docs/source/roadmap.md
docs/source/migration/v0.46.md (NEW — consumer migration guide)
docs/source/adr/0002-scorecard-as-primary-metric-surface.md (NEW)
CHANGELOG.md (new v0.45.0 + v0.46.0 entries since Round 5)
Source tree, with focus on the v0.45 + v0.46 additions:
- src/eval_toolkit/stacking.py (v0.45, ~370 lines)
- src/eval_toolkit/scorecards.py (renamed from _scorecard.py at v0.49.0; ~585 lines)
- src/eval_toolkit/metric_specs.py (v0.46, ~180 lines)
- src/eval_toolkit/__init__.py — __getattr__ deprecation branch + _DEPRECATED_SCALARS (Audit F4 invariant region)
- src/eval_toolkit/metrics.py:120-200 — SINGLE_CLASS_INCOMPATIBLE_METRICS extension (Round-5 X.2 precondition)

Audit prompt: ~/.claude/plans/gate3-audit-prompt.md (local). The “Known issues already in the v0.48 backlog (skip re-reporting)” section already lists drift items scheduled for v0.48 polish — Round 6 reviewers should skip those and surface only NEW findings against the v0.46 state.

Focus areas for Round 6 review:

scorecard surface design lock-in. The Tier-2 MetricSpec Protocol freezes at v1.0 — method-signature changes require a v2.0 major bump. Last cheap chance to catch contract gaps.
MetricResult cell-state contract — does the ok / skipped / error vocabulary cover every relevant failure mode? Are the reason strings useful for triage?
Per-cell error isolation — confirm that catching all exceptions in _evaluate_spec doesn’t hide important failures the user should see.
__getattr__ deprecation shim — Audit F4 invariant: does the branch correctly route deprecated names, NOT break non-deprecated resolution, and cleanly delete at v0.47?
Spec name encoding for parameterized metrics — is "ece_n_bins_15_strategy_uniform" a stable v1.0 commitment, or does the alphabetize-kwargs rule create surprise keys for custom user specs with multi-kwarg signatures?
Scorecard.to_pandas() MultiIndex schema — first-time-public; any shape lock-in concerns?

Triage on findings: each blocker → p1-gate3-labelled GitHub issue + a row in this ledger. Either fix-as-v0.46.1-hotfix or fold into v0.47 design (per Decision Q severity-tiered hotfix policy).

ID	Reviewer	Severity	Finding	Disposition	Lands
R6-F1	Codex	BLOCKER before v0.47 opens	`metric_specs.ece(strategy="typo")` silently dispatches to quantile ECE and returns scorecard cell with `status="ok"` under invalid key (`"ece_n_bins_15_strategy_typo"`). Wrong-by-design data correctness path. Verified via Codex runtime probe.	Add strategy validation in `ece()` factory + `_EceSpec.compute()`; raise `ValueError("ECE strategy must be 'uniform' or 'quantile'; got {strategy!r}")` (plan §2.5A).	RESOLVED v0.46.1 (commit `7a4bb14`, tag `v0.46.1` 2026-05-21; consumer pin bumped same day)
R6-F2	Codex + Gemini	HIGH before v0.47 scalar hard-removal	ECE deprecation warnings in `__init__.py:_scorecard_spec_for()` emit broken migration snippets for all 5 ECE variants. Two-part bug: (a) for the 2 variants in `metric_specs`, the suggested scorecard key uses the factory-call expression (`"ece(n_bins=10)"`) instead of the encoded spec name (`"ece_n_bins_10_strategy_uniform"`); (b) for the 3 variants NOT in `metric_specs` (`_debiased`, `_l2`, `_l2_debiased`), the fallback name isn’t an importable spec. Gemini claimed pre-v0.46 default was `n_bins=15` (verified incorrect — code at `metrics.py:730-734` shows `n_bins=10`); Decision R6-F resolves: warning uses `n_bins=10` to preserve pre-v0.46 math + adds migration note about new factory default.	Restructure `_scorecard_spec_for()` to return `(factory_expr, scorecard_key, has_first_party)` tuple; correct snippets for first-party variants with `n_bins=10`; submodule-path template for 3 non-first-party variants per Decision R6-G (plan §2.5B).	RESOLVED v0.46.1 (commit `7a4bb14`, tag `v0.46.1` 2026-05-21; consumer-side smoke verified all 5 ECE-variant warnings + submodule-path routing)
R6-F3	Codex	HIGH before scorecard freeze	Duplicate `MetricSpec.name` values in the same `scorecard()` call silently overwrite earlier cells (last-wins). Not a documented contract.	Decision R6-B (locked): reject in `scorecard()` with `ValueError("Duplicate MetricSpec name 'X' at index N; ...")`. Forces caller to disambiguate; no silent data loss. (Plan §4G.)	v0.47
R6-F4 (= Gemini R6-F1)	Codex + Gemini	HIGH before v1.0	`scorecard(seed=None)` documented as non-deterministic; implementation coerces `None → 0`. Doc/impl contradiction. Verified by Codex via bit-for-bit equality test.	Decision R6-A (locked): deterministic-by-default; fix docs only. No behavior change. Plan §4G-prep. (Decision R6-E: rolls to v0.47 — R6-A is non-blocker per Decision Q’s “docstring” category.)	v0.47
R6-F5	Codex	Contract-enforcement gap before v1.0	ADR 0003 promises strict Tier-2 Protocol method-shape stability; current public-API drift guard only snapshots `(args, *kwargs)` for Protocol classes, not method signatures. The guard does not see changes to `MetricSpec.compute`, `MetaLearner.fit`, etc.	Decision R6-D (locked): extend `tests/test_public_api.py` snapshot to capture Protocol method signatures via `inspect.signature` + `typing.get_type_hints` for the 9 Tier-2 Protocols. (Plan §4I.)	v0.47
R6-F6	Codex	Packet drift	v1.0 plan + roadmap still describe pre-v0.46 scorecard shapes that didn’t ship: `ece_n_bins_15` without strategy in plan, `ece_quantile()` factory listed (shipped as `ece(strategy='quantile')`), `MetricUndefinedError` mentioned (ADR 0002 chose no new public exception), `n_resamples >= 100` floor (shipped is `>= 1`). Roadmap “Currently shipped” still says v0.44.	Plan §4L: refresh plan §3A scorecard examples + roadmap shipped-state section. Doc-only commit on v0.47 release branch.	v0.47
R6-F3 (Gemini)	Gemini	MEDIUM (schema lock-in before v1.0)	`Scorecard.to_pandas()` MultiIndex columns expose `value, status, reason, ci_low, ci_high, confidence` but drop `n_resamples` + `method` from `BootstrapCI`. Provenance loss compared to `to_dict()`. v1.0 is about to lock the schema.	Decision R6-C (locked): add `n_resamples` + `method` columns at v0.47 (additive). Schema becomes lossless against `to_dict()`. (Plan §4H.)	v0.47
R6-F4 (Gemini)	Gemini	LOW	`MetricSpec` Protocol doesn’t enforce stable parameterized-spec naming. Custom users implementing multi-kwarg parameterized specs can silently spawn distinct dict keys if constructor arg order varies.	Decision R6-H (locked): add `make_spec_name(prefix, **kwargs)` canonicalization helper in `metric_specs.__all__` only (NOT top-level `_EXPORTS` — Tier-2 additive contract). Alphabetized kwargs, snake_cased, joined by underscore. (Plan §4J.)	v0.47
R6-F5 (Gemini)	Gemini	LOW	`_evaluate_spec()` wraps `spec.compute()` in broad `except Exception`. Swallows `MemoryError`, `RecursionError`, `KeyboardInterrupt`, `SystemExit` into cell state — process exhaustion / user-interrupt signals get hidden as metric errors.	Narrow exception catch: `except (MemoryError, RecursionError, KeyboardInterrupt, SystemExit): raise` first, then existing broad catch. (Plan §4K.)	v0.47

Round 6 totals: 11 findings (Codex 6 + Gemini 5; 2 overlap on seed=None + ECE deprecation snippets but with different reasoning angles). 1 BLOCKER (R6-F1) + 5 HIGH + 2 MEDIUM/contract + 3 LOW. All dispositioned to either v0.46.1 (2 fixes) or v0.47 (9 fixes).

Key follow-on decisions (driven by Round 6 — locked in plan):

Decision R6-A: seed=None deterministic-by-default; fix docs only.
Decision R6-B: Reject duplicate MetricSpec.name with ValueError.
Decision R6-C: Add n_resamples + method to to_pandas() schema.
Decision R6-D: Extend public-API snapshot to cover Protocol method signatures.
Decision R6-E: v0.46.1 scope = R6-F1 + R6-F2 only; R6-A rolls to v0.47 (non-blocker per Decision Q’s “docstring” category).
Decision R6-F: Use n_bins=10 (pre-v0.46 default) in deprecation warnings + migration note about new v0.46+ factory default of n_bins=15. Corrects Gemini’s misverified pre-v0.46 default claim.
Decision R6-G: 3 ECE variants without metric_specs (debiased, l2, l2_debiased) route deprecation warnings to submodule path; do NOT add to metric_specs at v0.47.
Decision R6-H: make_spec_name() helper in metric_specs submodule only; not top-level.

Round 6 v0.46.1 ship status (2026-05-21)#

R6-F1 ✅ SHIPPED in v0.46.1 (PR #67, squash 7a4bb14). End-to-end verified in consumer: ms.ece(strategy="typo") raises ValueError; direct _EceSpec(strategy=...) construction also raises (defence-in-depth).
R6-F2 ✅ SHIPPED in v0.46.1 (PR #67, squash 7a4bb14). End-to-end verified in consumer: eval_toolkit.expected_calibration_error warning carries ece(n_bins=10) + key ece_n_bins_10_strategy_uniform + migration note about v0.46+ n_bins=15 default. All 3 non-first-party variants (debiased, l2, l2_debiased) route to from eval_toolkit.metrics import … submodule path.
9 other Round 6 items (R6-A docstring, R6-B duplicate-name guard, R6-C to_pandas schema, R6-D Protocol method-shape snapshot, R6-F4-Gemini make_spec_name, R6-F5-Gemini narrow except, R6-F6 plan/roadmap state-drift) → folded into release/v0.47.0 per Decision R6-E.
Round 6 STOP-GATE status: CLOSED. release/v0.47.0 can open after the v0.46.1 consumer cycle observation completes (1 cycle).

Round 7 (complete: 2026-05-21 — Codex + Gemini; 3 substantive findings)#

Reviewers: author (manual) + Codex (independent report) + Gemini (independent report).

Packet: v0.47.0 code state + the v1.0 plan + docs/source/methodology/ (16 chapters) + ADRs 0001/0002/0003 + docs/source/migration/v0.46.md + docs/source/migration/v0.47.md + Round 5/6 ledger.

Round-7 briefing: gate3-audit-round-7.md (committed a9e1114).

Reports: gate3-audit-round-7-codex-report.md + gate3-audit-round-7-gemini-report.md (untracked per .gitignore).

Headline: Codex 3 substantive findings; Gemini 0. Overlap was zero between the two reports — the most consequential finding (R7-F1 doc-migration boundary gap between Sybil-tested fences and MyST-NB-executed example notebooks) was Codex-only. Reinforces the Round 6 pattern (do not use overlap as a confidence floor; single-reviewer findings can be the most critical).

ID	Reviewer	Severity (their words)	Finding	Disposition	Lands
R7-F1	Codex	high before v0.48	v0.47 doc migration missed MyST-NB executable example notebooks (separate from Sybil-collected `.md` fences). 6 example pages + 4 module-level docstrings + `protocols.md` autosummary + roadmap wording still reference removed APIs. Docs CI runs `sphinx-build` without `-W`, so notebook execution failures pass as advisory warnings. Verified via `sphinx-build` runtime probe — 6 execution failures buried in the warning stream.	Decision R7-A (locked at /exploring-options Q3): bundle into v0.48 §5G/§5H. §5G migrates the 6 notebooks + 4 docstrings + autosummary + roadmap; §5H enables `nb_execution_raise_on_error = True` in `conf.py`. Audit-as-seed expansion (Q2 locked full sweep) covered ALL module docstrings + drift in 5 existing `api/.md` autosummary lists + 8 missing `api/.md` pages.	RESOLVED v0.48.0 (§5G commit `e07db16` + §5H commit `6349472` on `release/v0.48.0`)
R7-F2	Codex	high before sweep freezes	`sweep()` records only `strategy.name` per row; two configured instances of same dataclass (e.g., `DelimitVariant(delimiter="<<")` + `DelimitVariant(delimiter="[[")`) silently merge under `groupby("variant")`. Style-coherent defect class with Round 6 R6-F3 (scorecard duplicate name) but with different semantics (row container vs. Mapping).	Decision R7-B option C (locked): emit `strategy_id` canonical column AND reject duplicate `strategy_id` at sweep boundary. Style invariants 1 (no silent failures) + 2 (natural call pattern is right) + 4 (canonical identifier + reject in canonical dimension) read together.	RESOLVED v0.48.0 (§5I commit `f454afe`)
R7-F3	Codex	worth fixing before v1.0	`sweep()` doesn’t validate scorer output cardinality. Three failure modes via runtime probe: overlong 1-D → silent truncation (worst); short 1-D → IndexError later; (n,2) matrix → TypeError when `float()` applied.	Decision R7-C (locked): API-level `ValueError` with contextual label at the sweep boundary; replaces all three low-level failure modes. Style invariants 1 + 3.	RESOLVED v0.48.0 (§5J commit `fcf99f0`)

Gemini observations (Round 7)#

Gemini’s report verdict was “highly stable; release/v0.48.0 is safe to open.” Six minor observations / validations; nothing critical that Codex hadn’t covered. The actionable items folded into v0.48:

§1-3 + 5-7: VALIDATIONS of v0.47 shipped state (TextTransform shape, shim removal, sweep design, R6-D Protocol method-shape snapshot, ADR 0003 tiers). No action needed.
§4 (pedagogical drift): Gemini noted “from eval_toolkit.metrics import pr_auc” is syntactically green but slightly undermines ADR 0002. v0.48 §5G migration explicitly chose scorecard() for example notebooks teaching METRIC USAGE; submodule path only where teaching the underlying math.
§4 (Makefile pre-push): Gemini recommended hardening to prevent the pytest tests/ path-override trap. Landed as v0.48 §5L (make pre-push target running all 3 doc-execution surfaces; commit 9878a54).
§5 (R6-C dtype coercion): Gemini noted n_resamples (int + NaN) → float64 is an accepted tradeoff. Landed as v0.48 §5K (Notes section on Scorecard.to_pandas() docstring; commit 6304cea).
§6 (SynonymSubstitution whitelist): Gemini recommended adding a docstring note about the hardcoded 6-entry whitelist. Landed as v0.48 §5K (adversarial.py Notes section; commit 6304cea).

Audit-as-seed extensions (v0.48)#

Per user direction during plan refinement (“use the audits as seeds for things to reconsider”), the Round 7 findings + style-invariants framing surfaced additional v0.48 scope beyond Codex’s explicit list:

§5G expansion: from 4 Codex-flagged module docstrings to full sweep across src/eval_toolkit/ module docstrings + audit of all docs/source/api/*.md autosummary pages. Found 8 missing API pages + 5 drifted autosummary lists.
§5M new: in-source docstring drift audit (third doc-execution surface). Result: 82 PASS / 1 skipped / 0 fail; expanded .doctest-modules from 11 → 21 modules so CI catches future drift.
§5N comprehensive: cross-API shape-validation consistency sweep beyond Codex’s R7-F3 target. Audited metrics_at_threshold, paired_bootstrap_op_point_diff, bootstrap_metric_from_predictions, metrics.py scalars, fit_*_binary / fit_*_calibrator. Tightening commit landed for metrics_at_threshold silent threshold semantics (commit 76773dc); paired_bootstrap_op_point_diff val_y is test_y guard landed as part of §5E-prep code-side fix (commit 5c8e68d).

Round 7 ship status#

3 substantive Codex findings: all RESOLVED in v0.48.0 via §5G + §5H + §5I + §5J.
6 Gemini observations: all RESOLVED in v0.48.0 via §5G + §5K + §5L.
Audit-as-seed extensions (§5G expanded, §5M new, §5N comprehensive): all RESOLVED in v0.48.0.
Round 7 STOP-GATE status: CLOSED via v0.48.0 release. Round 8 audit STOP-GATE per Decision Y.2 opens against the v0.48.0 state before v1.0.0 tag can land.

Round 8 (2026-05-24) — STOP-GATE before v1.0 tag, rectified in v0.51.0#

Multi-LLM cross-review fired against the v0.50.0 state. Codex returned 10 findings (3 with reproducible probes: C3, C4(a)/(b), C9); Gemini returned 5 findings + 3 positive validations (V1 style, V2 methodology, V3 scorer_error safety). Independent verification by Claude (audit-verification-codex-gemini-v0.50.0.md, 510 lines, untracked per .gitignore) confirmed 13 of 18 items, refuted 3 (R8-G2, R8-G5, R8-V1+R8-V2 paired-as-one over-confident Gemini validations), and deferred 2 (R8-G3, R8-G4) to v1.x as Tier-2 additive. The v0.51.0 release ships fixes for all 13 confirmed items; 2 (G3, G4) are deferred to v1.x as Tier-2 additive; 3 refuted entries are recorded below for the audit trail.

Round 9 multi-LLM cross-review runs against the v0.51 RC before v1.0 tags.

Confirmed (13) — fixed in v0.51.0#

ID	Severity	Finding	Disposition	Commit
R8-C3	blocker (escalated from Codex fix-rec)	`recall_at_fpr` fallback violated FPR ceiling (`actual_fpr=1.0` when `target_fpr=0.0`); probe `y=[0,1], scores=[1.0,1.0]`.	v0.51 sentinel `threshold=np.inf, actual_fpr=0.0, fp=0`.	`672d45f`
R8-C4(a)	blocker (escalated)	Generator-rng not parallel-stable in `_score_all_slices`; n_jobs=1 vs n_jobs=2 produced different CIs; SPEC 7 contract violation.	v0.51 spawns child SeedSequences at the dispatch boundary.	`61964f6`
R8-C4(b)	blocker (escalated)	`spawn_seed_sequences` ignored Generator state (extracted `bit_generator.seed_seq`).	v0.51 draws fresh entropy via `rng.integers(...)`.	`87453f6`
R8-C1	fix-recommended	`evaluate_folded(seeds=)` reused the splitter across the seed loop — partitions never varied.	v0.51 adds `reseed_splitter` callback + DeprecationWarning when multi-seed + None. Warning persists past v1.0.	`f60d43c`
R8-C2	fix-recommended	`SourceDisjointKFoldSplitter.iter_folds` yielded empty test partitions when k > n_sources; `get_n_splits` returned `min(k, n_sources)`.	v0.51 caps iter_folds at `min(k, n_sources)` + UserWarning.	`f60d43c`
R8-C5	fix-recommended	README links to `docs/...` (broken); migration toctree listed only v0.7–v0.9 despite v0.49–v0.51 BREAKING releases.	v0.51 repoints all README links to `docs/source/...`; toctree + MIGRATION.md index extended through v0.51; v0.49/v0.50/v0.51 migration guides authored.	`c206b54`
R8-C6	fix-recommended	`calibration.reliability_curve`, `maximum_calibration_error` did not validate `y_score ∈ [0,1]`; `fit_temperature` did not validate `bounds`.	v0.51 Tier-2 additive validation matching `metrics.py`-side ECE rigor.	`4c5e140`
R8-C7	fix-recommended	Tests covered counts where semantics matter: `test_harness_folded.py:92-107` no partition-content check; `test_harness_parallelism.py:77-122` no Generator-rng; `test_protocol_conformance.py:228-240` no k>n_sources.	v0.51 adds semantic-property regression tests bundled with C1/C2/C4 fixes.	`87453f6` + `61964f6` + `f60d43c`
R8-C8	minor	`SimilarityStrategy` listed as Tier-2 in README + `extending.md` but absent from `strict_tier2_protocols.md`’s 9-strict list.	v0.51 demotes SimilarityStrategy to “pre-v0.7 internal interface” in README + extending.md (aligns to v1.0 contract per ADR 0003).	`c206b54`
R8-C9	minor	`GateResult.to_dict()` returned numpy + NaN unchanged; `json.dumps(..., allow_nan=False)` raised TypeError on the result.	v0.51 docstring documents the JSON-safety contract — strict-JSON requires `artifacts.write_json_strict` or `sanitize_for_json`.	`c206b54`
R8-C10	minor	`.gitignore` patterns covered only `gate3-audit--report.md`; new `codex-comprehensive-audit--report.md` + `audit-verification-*.md` were tracked by default.	v0.51 `.gitignore` extended with new patterns.	`c206b54`
R8-F1	fix-recommended (Claude — missed-by-both)	`losses.RecallAtLowFPR.__init__` did not validate `pos_weight > 0`; non-positive values produced degenerate-but-bounded loss values silently.	v0.51 eager validation matching sibling-kwarg pattern.	`4c5e140`
R8-F2	minor (Claude — missed-by-both)	`metric_specs.ece(n_bins=)` validated `strategy` eagerly but deferred `n_bins` to compute time.	v0.51 eager `_validate_n_bins` call at factory level.	`4c5e140`
R8-F3	minor (Claude — missed-by-both)	`analysis.CsvPredictionReader.read_predictions` silently filled missing CSV columns with empty strings → cryptic dtype error downstream.	v0.51 detects missing columns at read time; raises actionable `ValueError`.	`4c5e140`

Refuted (3) — recorded for audit trail; NO fix shipped#

ID	Severity (as auditor marked it)	Finding (rejected)	Rationale
R8-G2	fix-recommended (Gemini)	“Cyclic import leakage→harness identified but left unresolved.”	REFUTED: `harness.py:709-712` resolves the cycle via TYPE_CHECKING + lazy `noqa: PLC0415` import; comment explicitly states the resolution. Cycle is deliberately broken by an asymmetric-by-design pattern, not unresolved.
R8-G5	minor (Gemini)	“Plotting tests use weak assertions (`assert fig is not None`).”	REFUTED: `tests/test_plotting_edge.py:125-140` does have that weak assertion, but the same file has structural assertions (line 92 `assert out is fig`; lines 220, 230 `.get_title()` equality). Cherry-picked; not representative.
R8-V1	“Style + consistency: Exceptional” (Gemini)	Validation that the axis is exceptional.	REFUTED: Gemini’s specific tooling claim (Ruff + Black + Mypy strict compliance) is true, but the axis is broader. Codex C1 + C5 + C6 + C7 + C8 + R8-G3 confirm substantive cross-API / docs / contract consistency drifts. Tooling-clean ≠ axis-exceptional.
R8-V2	“Domain methodology: Masterclass” (Gemini)	Validation that methodology is masterclass.	REFUTED: Codex C1 + C3 + C4 + C6 confirm methodology bugs at fix-recommended-or-blocker severity. Methodology has real flaws; “Masterclass” overstates. Gemini’s evidence (research-grounded tests exist) conflates research-citation discipline with methodology correctness.
R8-V3	“scorer_error catches predictions safely” (Gemini)	Validation that the mechanism is safe.	CONFIRMED, recorded for completeness: `harness.py:468-474` correctly re-raises `MemoryError` + `AssertionError` BEFORE the broad `except Exception`, so critical exceptions escape the silent-capture path.

Deferred to v1.x (2)#

ID	Severity	Finding	Disposition
R8-G3	fix-recommended (Gemini)	Codebase uses `ValueError` for every error condition (295 raises in src/; 0 custom exception classes).	Deferred to v1.x as Tier-2 additive. Custom exception hierarchy subclassing ValueError preserves catch-compatibility; adding it post-v1.0 is non-breaking. Reconsider when a downstream consumer requests it.
R8-G4	minor (Gemini)	joblib OOM hazard documented at `_parallel.py:55-59` but not structurally mitigated (no memory-aware `n_jobs` capping in `harness.evaluate`).	Deferred to v1.x. Memory-aware capping needs RAM-measurement + dataframe-size accounting — non-trivial, no clear best-practice. Caller is responsible for `n_jobs` sizing under their RAM budget at v1.0.

Round 8 ship status#

13 confirmed findings: all RESOLVED in v0.51.0 via the per-finding commits above (Phase 1 + 2 + 3 + 4 on release/v0.51.0).
3 refuted findings (R8-G2 cyclic-import framing, R8-G5 cherry-picked weak test, R8-V1+R8-V2 over-confident Gemini validations paired-as-one): recorded above; no shipped fix needed.
2 deferred findings (R8-G3 custom exceptions, R8-G4 joblib OOM capping): Tier-2 additive; v1.x or later.
Tally: 13 + 3 + 2 = 18 total (canonical; aligned with CHANGELOG.md v0.51 section + migration/v0.51.md per RC4 reconciliation).
Round 8 STOP-GATE status: CLOSED via v0.51.0 ship. Round 9 audit STOP-GATE per Decision Y.2 opens against the v0.51.0 RC before v1.0.0 tag can land.

Round 9 (2026-05-24) — STOP-GATE before v1.0 tag#

Round 9 multi-LLM cross-review fired against the release/v0.51.0 RC (7 commits at the time of audit; PR #75 draft, CI green). Codex returned 4 substantive findings (RC1-RC4) + self-validation worklog (69/69 v0.51 regression tests pass); explicitly skipped the 10 modules neither Round-8 auditor cited + 4 cross-cutting hunts. Gemini returned 6 items (RG1-RG6): 3 design challenges (2 escalating already-locked v0.51 decisions to v1.0 blockers) + 3 positive validations; zero probes, zero line citations, ~45% validation density. Independent verification by Claude (audit-verification-round-9-v0.51.0.md, 228 lines; gitignored per R8-C10) confirmed 6 / refuted 3 / partial 1 of the 10 source-report items, AND surfaced 3 third-audit findings + 3 minors by hunting the modules Codex skipped. Of those, F-sweep-1 was a CANDIDATE v1.0 BLOCKER (R7-C “no silent failures” invariant violation on NaN scorer output) that neither Round-8 nor Round-9 auditor caught.

Per the Path 3 lock from the post-R9 planning session, two third-audit findings (F-sweep-1 + F-bootstrap-1) ship fixed in this PR; RC1 already fix-in-PR at 4c43771; the remaining items defer to v1.0.1.

Confirmed (6) — disposition mapped#

ID	Severity	Finding	Disposition	Commit / Defer-to
R9-RC1	fix-recommended (Codex)	README.md:208 has 2 broken `docs/extending.md` + `docs/examples/claims_and_gates.md` hyperlinks — R8-C5 regression (only the API-surface table-cell links; other R8-C5 sites are correct).	Fixed in PR.	`4c43771`
R9-RC2	fix-recommended (Codex)	`SimilarityStrategy` contract mismatch: `__init__.py:294` exports it + `snapshot.json:1103-1117` pins `protocol_methods` (de facto strict Tier-2 STRICT contract) BUT `extending.md:18` + `strict_tier2_protocols.md` 9-strict list omit it. Snapshot IS the v1.0 contract regardless of docs.	Deferred to v1.0.1. Two paths: add to strict-Tier-2 list (10 strict + 1 opt-in; ADR 0003 update) OR remove from `__all__` + snapshot.	v1.0.1
R9-RC3	fix-recommended (Codex)	`test_evaluate_folded_reseed_splitter_varies_partitions` at `tests/test_harness_folded.py:134-164` only asserts key existence + count; no row-content comparison across seeds. Docstring claims partitions differ but assertions don’t check it. Directly REFUTES Gemini RG6.	Deferred to v1.0.1 test hardening.	v1.0.1
R9-RC4	minor (Codex)	“3 refuted” counts categories (V1+V2 paired as one) while 4 named items listed; defensible across all 4 surfaces but invites confusion.	Deferred to v1.0.1 docs polish.	v1.0.1
R9-RG4	validation (Gemini)	metrics.py `SINGLE_CLASS_INCOMPATIBLE_METRICS` pattern is “elegant”.	Validation accurate; pattern at metrics.py:35-99 is well-designed.	—
R9-RG5	validation (Gemini)	bootstrap.py defaults match “industry standards”.	Validation accurate: `bootstrap.py:66-69` shows N=1000, α=0.95, BCa, seed=42 — all match common conventions. Terminology nit: Gemini used outdated “seed=” instead of v0.50 “rng=”.	—

Refuted (3) — recorded for audit trail; NO fix shipped#

ID	Severity (as auditor marked it)	Finding (rejected)	Rationale
R9-RG2	v1.0 blocker (Gemini)	“DeprecationWarning without sunset date is structurally wrong; v1.0 blocker.”	REFUTED: re-litigates Q3-locked v0.51 design. `harness.py:1413-1414` shows the deliberate implementation; lines 1347-1349 + 1368 document “warning persists past v1.0” as chosen contract. Tested by `test_evaluate_folded_multi_seed_without_reseed_emits_deprecation_warning`. Meta-pattern: Gemini escalates without engaging with locked reasoning (same as R8 V1+V2).
R9-RG3	v1.0 blocker (Gemini)	“iter_folds capping silently is wrong; should raise.”	REFUTED: re-litigates R8-C2 design lock. `splits.py:325-339` shows the deliberate cap-with-UserWarning implementation; warn-vs-raise IS a judgment call; the v0.51 decision was permissive-by-design (caller can opt into `warnings.filterwarnings('error', UserWarning)` for strictness).
R9-RG6	validation (Gemini)	“Regression tests apply strict invariant assertions, not weak proxies.”	REFUTED — directly contradicted by RC3 which I independently confirmed. `tests/test_harness_folded.py:134-164` is precisely the weak-proxy pattern Gemini claims doesn’t exist. Gemini didn’t read the cited tests; textbook over-confident validation.

Partial (1)#

ID	Severity	Finding	Disposition
R9-RG1	accept-with-rationale (Gemini)	`np.inf` sentinel from `recall_at_fpr` risks silent downstream failure.	PARTIAL: internally NO callers (grep across `src/` returned only function def + docstring mentions + an unrelated local var in `losses.py:156`). Sentinel risk is real only at user-code boundary; mitigated by Q1-locked design + docstring’s “caller may filter via np.isinf” guidance. Accept-with-rationale at v1.0.

Third-audit findings (Claude — modules neither auditor cited) — 3 substantive + 3 minor#

ID	Severity	Finding	Disposition	Commit / Defer-to
R9-F-sweep-1	blocker for v1.0 (Claude escalation)	`_sweep.py:_validate_scorer_output()` accepts NaN/inf scorer outputs without validation; R7-C “no silent failures” invariant violation in a module Codex R7 designed but didn’t audit for finiteness. Stacking.py validates non-finite scores; sweep.py didn’t.	Fixed in PR. NaN/inf check added to R7-C boundary; test bundled.	New commit on `release/v0.51.0` (this PR)
R9-F-bootstrap-1	fix-recommended	scipy’s BCa returns degenerate CIs (`ci_low == ci_high == point` or NaN bounds) on small n + ceiling/floor metrics; scipy emits DegenerateDataWarning but doesn’t raise; pre-v0.51 R8-C4(b) RNG bug may have masked this.	Fixed in PR. UserWarning added at the bootstrap_ci boundary when BCa degenerates; test bundled.	New commit on `release/v0.51.0` (this PR)
R9-F-bootstrap-2	minor	`mde_from_ci` width check `if width <= 0` doesn’t catch NaN width (NaN <= 0 is False in IEEE float); silent NaN MDE if BCa returns NaN bounds.	Bundled into the F-bootstrap-1 commit. `or not np.isfinite(width)` guard added.	New commit on `release/v0.51.0` (this PR)
R9-F-metrics-1	minor (Claude)	`metrics.py:1305` Brier-decomposition docstring claims “≈ 1e-9” precision but actual error ~3e-3 dominated by binning. Self-contradictory.	Deferred to v1.0.1 docs polish.	v1.0.1
R9-F-metrics-3	fix-recommended (Claude)	`expected_calibration_error(y=[0,0,1,1], y_score=[0.5,0.5,0.5,0.5], n_bins=10)` returns 0.0 — technically correct per formula but semantically misleading (uninformative model looks perfectly calibrated). Equal-mass binning correctly rejects n<n_bins; equal-width has no guard.	Deferred to v1.0.1; either guard against all-same-score input OR explicitly document.	v1.0.1
R9-F-metrics-4	minor (Claude)	`brier_score` single-class docstring ambiguous about whether single-class is supported; implementation correctly handles it but wording is “implementation detail” not contract statement.	Deferred to v1.0.1 docs polish.	v1.0.1

Round 9 ship status#

6 confirmed source-report findings: 1 fixed in PR (RC1 → 4c43771), 4 deferred to v1.0.1, 2 validation-confirmations (RG4, RG5 — no action).
3 refuted source-report findings (RG2, RG3, RG6): recorded above for audit trail; no shipped fix.
1 partial (RG1): accept-with-rationale at v1.0.
2 third-audit fixes shipped in this PR: F-sweep-1 (CANDIDATE BLOCKER closed); F-bootstrap-1 + F-bootstrap-2 bundled (degeneracy warning + NaN guard).
4 third-audit items deferred to v1.0.1: F-metrics-1, F-metrics-3, F-metrics-4, RC2 (SimilarityStrategy contract resolution).
Round 9 STOP-GATE status: CLOSED with mitigations per the Path 3 lock. v1.0 can tag from this PR post-CI-green.

Multi-LLM audit-machinery calibration notes#

Codex R9 quality: 4/4 findings hold; RC1 over-cited (3 sites named but only 1 had real broken links); no padding finding this round. Gap remains: explicit skip of the 10 untouched modules + 4 cross-cutting hunts. F-sweep-1 (the candidate blocker) lives in precisely that gap. Future-round briefings should make the untouched-modules + hunts mandatory.
Gemini R9 quality: 2/6 validations honest (RG4, RG5); 3/6 REFUTED (RG2, RG3, RG6). RG6 is the new V1/V2 pattern — uncited positive validation on tests Gemini didn’t read; directly contradicted by Codex’s RC3 + my verification. RG2 + RG3 escalate locked design without engaging with the recorded rationale.
Trust calibration: Codex’s probe-backed findings → high trust; Codex’s untouched-module gaps → fill with Claude third-audit. Gemini’s validations → unreliable; Gemini’s design dissent → “raises the question” not “v1.0 blocker.”

Round 10 (2026-05-25) — v1.0 pre-tag micro-audit#

Scoped Codex + Gemini micro-audit on the edadddc R9 follow-on commit only (full ~/Claude/audit-templates/audit-prompt.md template with a scope-override preamble constraining attention to ~126 LOC src + ~110 LOC tests in _sweep.py + bootstrap.py). Dispatched per the Round 10 locking decision: targeted micro-audit, not full multi-LLM gate re-run. Codex returned a 311-line report with 3 substantive findings + per-axis verdicts + 4-test probe verification (all 4 passed). Gemini returned an 86-line report with 2 findings (1 self-recommended-status-quo + 1 fix-recommended on metrics.py finiteness — direct contradiction with Codex’s “Completeness” verdict). Claude verification re-read every cited line. Reports archived locally at codex-microaudit-edadddc-report.md + gemini-microaudit-edadddc-report.md (gitignored).

Confirmed (3) — disposition mapped#

ID	Severity	Source	Finding	Disposition	Commit
R10-F1	fix-recommended	Codex	`_sweep.py:298-299` error message says “finite floats in [0, 1]” but check is `np.isfinite` only — no range enforcement. Cross-ref `protocols.py:29-51` Scorer Protocol docstring also lacked explicit `[0, 1]` contract statement.	Fixed in PR (Codex Option C): extend Protocol docstring to document `[0, 1]` calibrated-probability semantics + reword sweep runtime message to drop `[0, 1]` (boundary still doesn’t enforce range; enforcement deferred to a future minor once consumer usage patterns clarify).	R10 follow-on commit on `release/v0.51.0`
R10-F2	fix-recommended	Codex	`tests/test_bootstrap_unit.py:337-343` BCa degeneracy test uses `if ci.ci_low == ci.ci_high == ci.point_estimate:` — silently no-ops when scipy returns NaN bounds (NaN==NaN is False in IEEE float). Test passes WITHOUT proving the warning fires. Codex probe-verified the no-op path on the current scipy fixture.	Fixed in PR (Codex Option A): mirror the production predicate — `(not np.isfinite(low)) or (not np.isfinite(high)) or (low == high == point)`. Assertion block runs whenever ANY degeneracy mode fires.	R10 follow-on commit on `release/v0.51.0`
R10-F3	minor-observation	Codex	`bootstrap.py:1099-1105` Raises docstring says “non-positive width” but `:1132-1140` implementation rejects on `width <= 0 or not np.isfinite(width)`. Code stricter than docs.	Fixed in PR (Codex Option A + brief Option B context): update Raises text to “non-positive or non-finite width” + 4-line note explaining scipy BCa NaN-bound motivation.	R10 follow-on commit on `release/v0.51.0`

Partial (1) — accept-as-design#

ID	Severity	Source	Finding	Disposition
R10-RG1	minor-observation	Gemini	`bootstrap.py:376-386` BCa degeneracy check uses output proxy (`ci_low == ci_high == point` + non-finite bounds) rather than catching scipy’s internal `DegenerateDataWarning`. May miss edge-case degeneracies where jackknife acceleration is undefined but bounds don’t exactly collapse.	Accept-as-design. Gemini’s own recommendation was Option B (status quo): “Deterministic, fast, and covers the primary ‘small n + ceiling metric’ failure mode perfectly.” Output-proxy approach avoids `warnings.catch_warnings` overhead on a hot path. Catching internal `DegenerateDataWarning` is a potential v1.1 enhancement if real-world degeneracy modes surface that the output proxy misses.

Refuted (1) — Pattern-1 calibration record#

ID	Severity (as Gemini marked it)	Source	Finding	Rationale
R10-RG2	fix-recommended	Gemini	“`metrics.py` functions like `pr_auc` and `roc_auc` lack a similar `np.isfinite` boundary check.”	REFUTED. `metrics.py:_validate_inputs` (line 1846) explicitly contains `if not np.isfinite(y_score_arr).all(): raise ValueError("y_score contains NaN or inf")`. Called at 20+ public-function sites (lines 366, 441, 479, 552, 650, 802, 872, 942, 1036, 1157, 1261, 1337, 1429, plus more). The check predates the audit chain (the comment “harmonizes with `score_distribution_summary`’s own guard” suggests pre-v0.46 vintage). Gemini did not read the cited code before making the claim — textbook Pattern-1 (validation-without-reading) violation, despite the micro-audit prompt’s `Calibrated-confidence discipline` section explicitly calling out this exact pattern. The calibration record is preserved here.

Multi-LLM audit-machinery calibration (R10)#

Codex R10 quality: 3/3 substantive findings hold; probe-backed; cited file:line on every observation; honest “What I didn’t look at” calibration. Same depth and rigor as R8 + R9.
Gemini R10 quality: 1/2 findings honest (F1, with self-recommendation = status quo — effectively an open-question framed as a finding); 1/2 REFUTED (F2, Pattern-1 violation). The prompt’s explicit Pattern-1 discipline section did NOT change the outcome. Pattern-1 is now confirmed across R8 V1+V2, R9 RG6, and R10 RG2 — four independent rounds. Conclusion: Gemini’s training-trace defaults toward positive validation without reading; explicit prompt-level discipline is necessary but not sufficient to correct it. Practical implication: future audits should treat Gemini validations as “raises the question” and require Claude (or Codex) read-back before accepting any positive validation.
Round 10 STOP-GATE status: CLOSED with mitigations — 3 Codex confirmed findings fixed in this RC; 1 Gemini accept-as-design; 1 Gemini refuted. v1.0 can tag from the R10 follow-on commit post-CI-green.

Round 11 (2026-05-26) — consumer-feedback-driven validator adoption#

Not a multi-LLM gate-style audit. Round 11 is the first consumer-feedback-driven round: consumer-side audit findings → upstream validator shipped → consumer adopts. Multi-LLM cross-review (Rounds 5–10) was the v1.0 stability-contract activation discipline; Round 11 is the v1.x ongoing consumer-feedback loop. Future Round 12+ would resume multi-LLM dispatch only if v2.0 work or a major audit cycle re-opens. This is flagged so readers don’t confuse R11 with the heavyweight gate-style rounds.

Motivating findings (consumer audit origin)#

V1.3.1 ADR-080 audit-fix (2026-05-22, consumer prompt-injection-detection-prototype): WRITEUP_NARRATIVE.md:38 said “TF-IDF + LR reaches 0.974 AUPRC” but canonical TF-IDF direct val AUPRC = 0.971 (0.974 is LoRA’s value). The existing audit_numbers.py validates VALUES against source data but not BINDINGS — both 0.971 AND 0.974 exist in the expected set; the bug is the wrong pairing.
V1.3.2 P1-2 Part 2 (2026-05-23, consumer same repo): docs/REPRODUCIBILITY.md:76 cited “(locked at Phase 0-07 via ADR-029)” — ADR-029 is the test-marker-strategy ADR, not the reproducibility tier-lock ADR (which is ADR-034). 5-digit-off citation; lychee resolves it (ADR-029 exists), audit_numbers.py validates values, but neither catches semantic mis-citation.

Upstream response timeline#

Time (UTC)	Event
2026-05-24	Consumer files #71 (value_bindings), #72 (concept_drift), #73 (citation_alignment) at v1.3.3
2026-05-24 (later)	PR #74 opens with audit subpackage skeleton + `validate_citations`
2026-05-25 17:01Z	v1.0.0 tag (stability contract activates per ADR 0003)
2026-05-25 17:08Z	v1.0 snapshot regen hotfix (`eb94d26`)
2026-05-25 22:51Z	PR #74 merged (`audit_citation_alignment` flat-module per ADR 0001; restructured from initial subpackage shape)
2026-05-25 23:48Z	v1.0.1 ships — `audit_citation_alignment` + RC4 docs polish
2026-05-26 01:02Z	v1.0.2 ships — #76 cleanup batch (RC2 SimilarityStrategy promoted to 10th strict Tier-2; RC3 test row-content hardening; F-metrics docstring polish)
2026-05-26 01:48Z	Consumer PR #10 merged at v1.3.7 — consumer-side `audit_citation_alignment` adoption complete
2026-05-26 02:35Z	v1.0.3 ships — `audit_value_bindings` + closes #71

End-to-end cycle: V1.3.2 P1-2 origin (2026-05-23) → consumer adoption (2026-05-26 01:48Z) = ~3 days. Upstream-side v1.0.0 → v1.0.3 (audit-validator family of 2): ~10h compressed cycle. Consumer’s v1.3.7 PR opened against audit_citation_alignment (the v1.0.1 deliverable); subsequent v1.0.3 audit_value_bindings adoption will happen on consumer’s own cadence.

Audit-validator family status#

Validator	Module	Released	Issue	Status
Citation alignment	`audit_citation_alignment`	v1.0.1 (`067d4b0`)	#73	✅ shipped + consumer-adopted (consumer PR #10 v1.3.7)
Value bindings	`audit_value_bindings`	v1.0.3 (`775f3da`)	#71	✅ shipped; consumer adoption pending
Concept drift (embedding-clustering)	`audit_sister_doc_concept_drift` (proposed)	future	#72	open; requires MiniLM (already vendored for leakage) + clustering design

Pre-tag + post-v1.0.3 dogfood summary#

The validators were dogfooded against eval-toolkit’s own markdown surface to prove they work in production before consumer adoption:

v1.0.1 pre-tag dogfood (validate_citations): 95 files (README + docs/source/**/*.md ex-ADRs ex-Sphinx-generated); 0 misalignments. Helper _build_adr_subjects_from_repo() exercised.
Post-v1.0.3 dogfood (re-run + synthetic validate_reader_value_bindings):
- Part 1 (citation_alignment expanded): 96 files; 0 misalignments (sanity preserved after audit_value_bindings.md + R10 + R11 additions to docs/source).
- Part 2 (value_bindings synthetic): bindings derived from Round 8 RC4-reconciled tally (("Round 8", "confirmed_count") -> 13 etc). Validator processed audit_findings.md + migration/v0.51.md + CHANGELOG.md without crashing; surfaced 216 pattern-matches + 82 matches. The high false-positive count is expected — the synthetic bindings dict has very loose detector (“Round 8”) + metric (numeric-count) patterns, and any number near a “Round 8” mention triggers. NOT real bugs in eval-toolkit’s docs — just the validator successfully exercising the regex+window pipeline at production-scale markdown.

The dogfood script is preserved at .scratch/dogfood_v1_0_x.py (gitignored) for future re-runs.

Round 11 outcome#

✅ Action 1 (consumer adoption): consumer PR #10 merged 2026-05-26 01:48Z. Consumer-side dogfood + adoption complete for validate_citations.
✅ Action 2 (this ledger entry): R11 section recorded.
✅ Action 3 (dogfood expansion): validators exercised on eval-toolkit’s own surface; no upstream-side findings.

GH issue #77 closed with state_reason=completed.

Multi-LLM audit cadence after R11#

The Gate 3 audit chain (R5–R10) closed at v1.0 per ADR 0003. R11+ is the post-v1.0 consumer-feedback loop and uses NO multi-LLM cross-review by default. Multi-LLM dispatch resumes only if:

A v2.0 design cycle opens (would require a Gate 3-equivalent re-audit per ADR 0003 §”v2.0 re-audit triggers”).
A consumer files a major-severity finding that single-LLM verification (Claude only) leaves uncertain — in which case Codex
- Gemini dispatch via the audit-prompt.md template re-engages for that specific question.

Otherwise R11+ is light-touch documentation of the consumer-feedback → upstream-validator → consumer-adopt cycle.

Round 12 (2026-05-26) — schema-extensibility + scope-correctness lesson from #80#

Not a multi-LLM gate-style audit. Round 12 is the second consumer-feedback-driven round, immediately following R11. The R11 audit-validator family (3/3 shipped) revealed a deeper structural gap in audit_value_bindings that the v1.0.3 implementation could not address: the 2-tuple (detector, metric) canonical-identity schema collapsed across slices, producing ~95 false positives on the consumer’s writeup. The fix (v1.1.0) is bigger than the issue’s literal ask — it surfaces two architectural rules that now govern all audit validators (pending ADR 0005).

Motivating finding (consumer audit origin)#

Consumer’s v1.3.9 run (2026-05-26, prompt-injection-detection-submission): scripts/audit_value_bindings.py against own writeup produced 96 warnings, ~95 false positives. The one real bug (WRITEUP_PAPER.md:545 mis-citing 0.971 inside a misleading paragraph context) was correctly flagged but hid in the noise. Consumer caught it via parallel manual audit, not via this validator. Filed #80 at 2026-05-26T18:15:45Z proposing a 3-tuple (detector, metric, slice) schema extension.

Upstream response timeline#

Time (UTC)	Event
2026-05-26T18:15Z	Consumer files #80 (BINDINGS slice-axis schema)
2026-05-26T18:30Z	`/exploring-options` Round 1: P0 path for v1.0.4 PyPI gap. Decision: refined bundle (rerun + workflow_dispatch + verify-receipt).
2026-05-26T19:00Z	`/exploring-options` Round 2: #80 migration shape. Decision: Option 4 — `BindingKey` structured key + multi-shape adapter (rejected Option 1 `(d, m, s)` in-place tuple because positional tuples lock in the recur-every-N-months schema-event pattern).
2026-05-26T19:24Z	v1.0.5 ships — infrastructure-only release (publish workflow hardening; dress rehearsal for the new Verify PyPI receipt step).
2026-05-26T19:30Z	`/exploring-options` Round 3: Step 1 release strategy + verify specs. Decision: v1.0.5 as dress rehearsal; conservative 12×30s PyPI poll.
2026-05-26T19:45Z	`/exploring-options` Round 4: ADR 0005 scope + deprecation policy. Decision: medium scope (audit validators only); indefinite acceptance of legacy 2-tuple BINDINGS through v1.x.
2026-05-26T20:00Z	First dogfood of v1.1.0 slice-axis fix: only 22% noise reduction (95 → 74). User pushed back: “what is the right long-term solution?” Triggered the deeper rethink.
2026-05-26T20:10Z	Identified the second architectural layer: scope correctness. ~80% of v1.0.5 residual noise was content-type confusion (CI brackets, table cells, code blocks), not slice-axis confusion. Added `scope='narrative'` filter.
2026-05-26T20:11Z	v1.1.0 ships — `BindingKey` + slice-aware matching + `scope='narrative'` (closes #80).

End-to-end cycle: #80 filing (18:15Z) → v1.1.0 ship (20:11Z) = ~2 hours compressed cycle. Four /exploring-options rounds + one user-driven architectural redirect (“ultrathink on what the right long-term solution is”) produced a coherent two-layer fix instead of the issue’s narrower literal ask.

The two-layer correctness lesson (pending ADR 0005)#

Audit validators have two correctness layers, not one:

Identity correctness — canonical measurements have structured identity (frozen dataclass with named fields), not positional tuples. Future identity axes (split, ci_kind, source_ref, …) added as defaulted fields without breaking the dict-key schema. Positional tuples lock the validator into a recur-every-N-months schema-event pattern.
Scope correctness — the validator should only scan content plausibly a binding claim. Narrative prose sentences are. Markdown table cells aren’t (they’re structured data, audited differently). Bracketed expressions ([CI 0.286, 0.301]) aren’t (inline expressions, not point estimates). Code blocks aren’t.

This mirrors lint-design conventions from ruff/mypy/bandit (scope predicates like # noqa/# nosec/# type: ignore are first-class, not optional).

Dogfood evidence#

Configuration	Warnings on consumer HEAD	Reduction vs v1.0.5 baseline
v1.0.5 (legacy 2-tuple)	95	— (baseline)
v1.1.0 BindingKey + slice-fix only (scope=’all’, default `slice_window_chars=120`)	74	-22%
v1.1.0 BindingKey + slice-fix + `scope='narrative'` (recommended)	23	-76%

The first instinct — “just add a slice axis like the issue asks” — would have left the validator practically unusable on dense academic prose, requiring a separate v1.2.0 cleanup release. The deeper rethink (triggered by the user’s redirect) delivered both layers in a single coherent v1.1.0.

Residual 23 warnings are pre-existing positional-heuristic limitations (sentence-boundary unawareness, multi-detector list parsing on dense prose) not addressable without parser-level work; deferred to future v1.2.0+ as a narrower issue class.

Round 12 outcome#

✅ v1.1.0 shipped (commit 911565b, PyPI live).
✅ #80 auto-closed by closes #80 directive on the v1.1.0 tag.
✅ 0 open issues on the eval-toolkit repo post-#80.
✅ Architecture lesson codified in pending ADR 0005: the identity + scope two-layer rule applies to future audit validators.
✅ Memory entry: feedback_validator_identity_plus_scope captures the “issue body says X is the problem; investigate the dominant noise source before agreeing” discipline.

Consumer adoption + HARD-gate viability#

Consumer (prompt-injection-detection-submission) currently pins eval-toolkit>=1.0.3,<2. Adoption path for v1.1.0:

Smallest diff: replace 2-tuple BINDINGS literal with 3-tuple keys (issue body’s proposal; works directly).
Recommended: migrate to BindingKey(detector=..., metric=..., slice=...) for forward-extensibility.
Add scope="narrative" to the validator call.
Re-run; expect ~76% noise reduction.

HARD-gate promotion at consumer’s v1.3.10+ becomes credible at the ~80% reduction level. Remaining residual false positives can be suppressed via consumer-side filtering (excluding lines containing “random floor” or “versus”) or accepted as known low-frequency noise pending v1.2.0+ pairing-rule improvements.

Multi-LLM audit cadence after R12#

Same as R11: no multi-LLM cross-review for consumer-feedback rounds; reserved for v2.0 design cycles or major-severity findings that single-LLM verification cannot resolve. The R12 fix was informed by the user-driven /exploring-options discipline (4 rounds + a structural redirect), which substitutes for multi-LLM dispatch for this class of architectural decision.

Round 13 (2026-05-26) — `audit_value_bindings` v1.2.0 context-aware noise reduction#

Not a multi-LLM gate-style audit. Third consumer-feedback-driven round, immediately following R12. R12 closed the slice-axis schema-extensibility gap (BindingKey + scope=’narrative’ content-type filter) and reduced the consumer’s noise floor from 96 → 36 warnings (62%). R13 addresses the residual 36 — the positional-heuristic limitations ADR 0005 named as “Future work (deferred)” — via four context-aware extensions to scope='narrative' shipped as v1.2.0.

Motivating evidence (consumer-side, post-v1.3.11)#

Consumer (prompt-injection-detection-submission@v1.3.11) ran the v1.1.0 audit_value_bindings against their writeup and produced 36 residual false positives. Consumer-side categorization in decisions/upstream_issues.md:92:

Category	Count	Discriminating prose pattern
Random-floor / sub-clause mentions	~10	“random floor of 0.374”, “baseline of 0.374”
Delta with sign / keyword	~9	“-0.071 AUPRC”, “drops -0.132 below”
Multi-detector “vs”/”versus” enumeration	~8	“AUPRC 0.556 vs 0.519” (same binding double-flagged)
Sentence-boundary cross-pairing	~5	“X scored 0.291. The floor is 0.374”
Genuinely ambiguous (cross-detector list constructions)	~4	“0.293 versus 0.364 for the frozen probe and 0.291 for TF-IDF”

The first four categories share a structural property: they’re narrative-prose context cues that the v1.1.0 positional heuristic can’t read. ADR 0005’s “Future work (deferred)” section named them explicitly. The fifth category (cross-detector list grammar) is a deeper parser-level problem.

Upstream design — `/exploring-options` 4 rounds + ultrathink redirect#

Time (UTC)	Event
~20:30Z	`/exploring-options` Round 1: which categories to address in v1.2.0? Decision: Full T1+T2+T3+T4 over the Explore agent’s recommended T1+T2-only. T3 and T4 are cheap (~70 LOC combined) and address ADR 0005-named deferred work.
~20:45Z	`/exploring-options` Round 2: API placement. Decision: all four filters bundled under `scope='narrative'`; backward compat for `scope='all'` preserved exactly.
~21:00Z	`/exploring-options` Round 3: sentence-boundary detection rigor. Decision: paragraph-aware abbreviation guard (vs./e.g./i.e./etc./cf./fig./eq./pp./viz./ca. excluded; single `\n` soft, `\n\n` hard; decimal + letter-dot-letter patterns guarded). Rejected naive `α`/`β` (over-splits multi-line academic prose); rejected `γ` library-based (nltk/spacy too heavy).
~21:10Z	`/exploring-options` Round 4: keyword list configurability. Decision: hardcoded module-level `frozenset` constants (`_DELTA_KEYWORDS`, `_FLOOR_KEYWORDS`). No new public kwargs.
~21:20Z	v1.2.0 ships — commit `40b3741`, PyPI live, GH release published, `Verify PyPI receipt` step ✓. End-to-end from R12 closure to v1.2.0 ship: ~1 hour.

Dogfood evidence (compounded across the cycle)#

Release	Configuration	Warnings on consumer HEAD	Reduction vs v1.0.5
v1.0.5	Legacy 2-tuple, no scope filter	95	— (baseline)
v1.1.0	`BindingKey` + `scope='narrative'` content-type filter	23	-76%
v1.2.0	+ T1–T4 context filters	7	-93%

The 7 v1.2.0 residuals are all in the “cross-detector list construction” category — prose where the validator can’t infer that subsequent values belong to OTHER detectors via list connectives (“and”, “for X”, “vs”). T3 only deduplicates the SAME binding within one sentence; cross-detector list inference is a parser-level problem.

Round 13 outcome#

✅ v1.2.0 shipped (commit 40b3741, PyPI live).
✅ ADR 0005 amended: “Future work (deferred)” section notes the v1.2.0 partial closure; sentence-boundary + 3 related context-aware filters resolved; cross-detector list grammar remains v1.3.0+ territory.
✅ Consumer adoption path clear: re-pin eval-toolkit>=1.2.0,<2; HARD-gate promotion of audit_value_bindings becomes credible (7 residual < 10 actionable threshold).
✅ Memory entry updated: feedback_validator_identity_plus_scope extended with the v1.2.0 sequel — the same “identity + scope” framing accommodates the new context-aware filters as scope extensions, vindicating the original two-layer architecture decision.

Multi-LLM audit cadence after R13#

Unchanged from R11/R12: no multi-LLM cross-review for consumer-feedback rounds. The R13 design was informed by the user-driven /exploring-options discipline (4 rounds + tactical calibration during implementation), substituting for multi-LLM dispatch for this class of incremental refinement. Future Round 14+ would re-engage multi-LLM dispatch only if a v2.0 design cycle opens or a major-severity consumer finding requires it.

Cross-references#

ADR 0005 — “Future work (deferred)” section now reflects v1.2.0 closure status.
v1.2.0 CHANGELOG entry — full per-filter detail (T1–T4 keyword lists, window calibration, dogfood numbers).
tests/test_audit_value_bindings.py — 36 tests (28 from v1.1.0 + 8 new for T1–T4 + sentence-boundary unit test).

Round 14 (2026-05-26) — Layer 3 pairing rules close cross-detector list-grammar (#81)#

Not a multi-LLM gate-style audit. Fourth consumer-feedback-driven round in the post-v1.0 cycle (R11 → R12 → R13 → R14). The v1.2.0 cycle (Round 13) reduced consumer noise from 96 → 4 warnings via the T1–T4 context-aware narrative filters; the residual 4 were all cross-detector list-grammar / metric-axis confusion cases that the two-layer identity-+-scope model couldn’t resolve. Consumer filed #81 documenting the 3 prose patterns + proposing a _LIST_CONNECTIVES extension; upstream v1.3.0 ships Layer 3 (pairing rules) per ADR 0006 as the architectural closure.

Motivating residuals from v1.2.0 dogfood#

Consumer’s v1.3.12 dogfood (eval-toolkit v1.2.0 adopted; 36 warnings → narrowed by their additional SKIP_PATTERNS to 4):

File:Line	Pattern category	Prose snippet
`WRITEUP_PAPER.md:304`	“for X” postfix (Pattern A)	`"versus 0.364 [...] for the frozen probe and 0.291 [...] for TF-IDF + LR"`
`RESULTS.md:171` (×2)	possessive + metric confusion (Patterns B + D)	`"LoRA's pooled OOD AUROC is 0.383 against frozen probe's 0.515"`
`README.md:71`	group subject (Pattern C)	`"0.38 AUROC, ~0.6 drop for the trained detectors"`

Upstream design — `/exploring-options` 2 rounds + Pattern D emerged during dogfood#

Time (UTC)	Event
~21:30Z	Consumer files #81 documenting the 3 prose patterns (A/B/C) + proposing `_LIST_CONNECTIVES` design.
~22:00Z	`/exploring-options` Round 1: scope. Decision: A + B + C-suppress (not C-inference) — closes all 4 residuals at ~130 LOC; rejected A+B-only (leaves 1) and A+B+C-inference (~250 LOC, MODERATE-HIGH risk).
~22:10Z	`/exploring-options` Round 2: documentation. Decision: new ADR 0006 codifying Layer 3 (pairing rules) as the third correctness layer alongside ADR 0005’s identity + scope.
~22:30Z	Implementation reaches first dogfood: 4 → 2 warnings. Pattern D emerges: the residual 2 are metric-axis confusion (`"AUPRC delta suggests: AUROC is 0.383"` — wrong metric picked up by proximity). Added as fourth rule (symmetric to detector-axis pairing).
~22:45Z	Three tactical calibrations: (a) Pattern A intervening-value check now uses v1.1.0’s `excluded_ranges` (CI brackets don’t count as intervening); (b) Pattern B refactored to “last possessive within 30 chars” instead of requiring an “is” verb; (c) Pattern C uses v1.2.0’s `_crosses_sentence_boundary` to prevent paragraph-crossing suppression bleed.
~22:55Z	Override-bypass architecture: Pattern A/B now set `pairing_confirmed_pos` to BYPASS the proximity-based detector pairing when they confirm THIS binding’s detector (fixed the bug where override + proximity disagreed).
~23:00Z	v1.3.0 ships — commit `4c2742e`. Consumer dogfood: 4 → 0 warnings. Combined v1.0.5 → v1.3.0: 95 → 0 (-100%).

End-to-end from #81 filing to v1.3.0 ship: ~1.5 hours.

Layer 3 architecture (per ADR 0006)#

Audit validators now have a three-layer correctness model:

Layer	Dimension	Mechanism	Release
1	Identity	`BindingKey` structured keys	v1.1.0
2	Scope	Content-type + context-keyword filters (`scope='narrative'`)	v1.1.0 + v1.2.0
3	Pairing	Override / suppress proximity-based pairing under grammar cues	v1.3.0

Four Layer 3 rules ship in v1.3.0 (all under scope='narrative', no new public kwargs):

Pattern A: "for {detector}" postfix → override detector pairing (with intervening-value guard via v1.1.0 exclusion-ranges).
Pattern B: "{detector}'s" possessive → override detector pairing (last possessive within 30 chars of value).
Pattern C: "for the {trained|frozen|baseline|all|both|other} detectors" group subject → suppress candidate (with sentence-boundary guard via v1.2.0 sentence-positions).
Pattern D: metric-axis nearest-pairing (symmetric to detector-axis). Pre-collects ALL metric positions across consumer-supplied metric_aliases keys, not just binding-derived metrics.

Dogfood evidence (compounded across the four rounds)#

Release	Configuration	Warnings on consumer HEAD	Reduction vs v1.0.5
v1.0.5	Legacy 2-tuple, no scope filter	95	— (baseline)
v1.1.0	BindingKey + scope=’narrative’ content-type	23	-76%
v1.2.0	+ T1–T4 context filters	7	-93%
v1.3.0	+ Patterns A/B/C/D pairing rules	0	-100%

Round 14 outcome#

✅ v1.3.0 shipped (commit 4c2742e, PyPI live).
✅ #81 auto-closed by closes #81 directive on the v1.3.0 tag.
✅ 0 open issues on the eval-toolkit repo post-#81.
✅ ADR 0006 committed — Layer 3 pairing rules formally codified as the third correctness layer.
✅ Consumer HARD-gate promotion now credible. With 0 residual warnings, the consumer can promote audit_value_bindings from SOFT to HARD bundled with audit_citation_alignment per their v1.3.8 plan.

The post-v1.0 consumer-feedback cycle (R11 → R14)#

The four post-v1.0 rounds form a complete consumer-feedback loop:

Round	Driver	Cycle time	Closure
R11	Consumer adopts v1.0.x audit-validator family	days	v1.0.4 (3 validators shipped)
R12	Consumer files #80 (BINDINGS slice-axis)	~2 hours	v1.1.0 + v1.2.0 (`BindingKey` + scope=’narrative’)
R13	v1.1.0 dogfood surfaces context-filter gaps	~1 hour	v1.2.0 (T1–T4 context filters; 95 → 23)
R14	Consumer files #81 (cross-detector list-grammar)	~1.5 hours	v1.3.0 (Layer 3 pairing rules; 4 → 0)

The pattern is consistent: consumer surfaces a structural gap → upstream files an ADR-driven design → ships within hours → consumer adopts same-day. The 100% noise reduction over four rounds validates the iterative consumer-feedback model as a substitute for heavyweight multi-LLM gate audits (which closed at v1.0 per ADR 0003).

Multi-LLM audit cadence after R14#

Unchanged from R11/R12/R13: no multi-LLM cross-review for consumer-feedback rounds. R14’s design was informed by /exploring-options (2 rounds) + Pattern D emergence during dogfood

3 tactical calibrations during implementation. Multi-LLM dispatch remains reserved for v2.0 design cycles or major-severity findings.

Cross-references#

ADR 0006 — Layer 3 codification.
v1.3.0 CHANGELOG entry — full per-pattern detail with consumer adoption path.
tests/test_audit_value_bindings.py — 43 tests (36 from v1.2.0 + 7 for Patterns A/B/C/D + backward-compat + combined dogfood).

Round 15 (2026-05-26) — audit_citation_alignment Layer 2 + Layer 3 + ADR 0007 family-wide architecture#

Not a multi-LLM gate-style audit. Fifth consumer-feedback-driven round, opening the equivalent cycle for audit_citation_alignment that R11→R14 closed for audit_value_bindings. Consumer filed eval-toolkit#82 concurrent with their v1.3.13 adoption: 188 residual warnings on audit_citation_alignment, same architectural-class gap (Layer 2 + Layer 3 context-awareness) that audit_value_bindings worked through across three releases.

Architectural insight: ADR 0007 generalizes the model#

ADR 0005 (Layer 1 + 2) and ADR 0006 (Layer 3) were originally framed as audit_value_bindings-specific. The v1.4.0 cycle adopting the same architecture for audit_citation_alignment revealed that the three-layer model is the canonical architecture for the audit-validator family — not just one validator. ADR 0007 codifies this family-wide.

The v1.4.0 release also extracted shared narrative-prose helpers into private flat module eval_toolkit/_narrative.py (consistent with ADR 0001’s _rng.py/_parallel.py/_sweep.py precedent). Both validators import from this module; future audit validators inherit the same primitives.

Three-layer model applied to audit_citation_alignment#

Pattern	Layer	Mechanism
β	Layer 2	Exclude citations inside markdown table rows, bracketed expressions, fenced code blocks (reuses v1.1.0 `_build_exclusion_ranges`).
γ	Layer 3	Category-keyword extraction window bounded by the SENTENCE containing the citation (reuses v1.2.0 `_sentence_boundary_positions`).
α	Layer 3	Multi-category set membership: when the sentence matches multiple category keywords (multi-topic prose), accept the citation if the ADR’s actual category is in the set.
None-skip	Layer 2 refinement	Defer (skip the citation) when `subject.category is None` — the consumer’s category map can’t classify the ADR. Symmetric to the existing `claim_category is None` skip from v1.0.1.

Upstream design — `/exploring-options` 3 rounds + Pattern α generalization + None-skip emerged during dogfood#

Time (UTC)	Event
~21:30Z	Consumer files #82 documenting 3 prose patterns (α/β/γ) + 188-warning baseline.
~22:00Z	`/exploring-options` Round 1: scope. Decision: bundle Path A + Path B in v1.4.0 (one coherent release; patterns established).
~22:10Z	`/exploring-options` Round 2: helper sharing. Decision: extract to private flat module `_narrative.py`.
~22:20Z	`/exploring-options` Round 3: documentation. Decision: write ADR 0007 generalizing three-layer model to family.
~22:30Z	Implementation reaches first dogfood: 188 → 101 (46% reduction). Pattern β + γ + initial α (multi-citation only) firing.
~22:40Z	Symmetric-None skip emerges: dominant residual is `actual=None` cases (ADRs the consumer’s category map can’t classify). Symmetric to existing `claim=None` skip from v1.0.1. Added under `scope='narrative'` → 101 → 44.
~22:50Z	Pattern α generalized: extend from “multi-citation sentences only” to “any multi-topic sentence” (any sentence matching ≥2 category keywords). 44 → 37.
~22:55Z	Decision: ship at 37 rather than over-engineer. Above the #82 ≤20 target but a 5× reduction; residual 37 includes real misalignments worth consumer-triage + edge cases requiring parser-level understanding.
~23:00Z	v1.4.0 ships — commit `7c40490`. ADR 0007 codifies family-wide architecture.

End-to-end from #82 filing to v1.4.0 ship: ~1.5 hours.

Dogfood evidence (audit_citation_alignment cycle)#

Release	Configuration	Warnings on consumer HEAD	Reduction vs v1.3.0 baseline
v1.3.0 (audit_citation_alignment with scope=’all’)	188	—	(baseline)
v1.4.0 (`scope='narrative'`)	37	-80%

The residual 37 is above the original #82 acceptance criterion (≤20) but represents a 5× reduction. Categorized:

Real misalignments consumer should triage (e.g., ADR-025 cited for a threshold claim when ADR-025 is the cost ADR — could be a wrong-ADR bug or a multi-topic ADR not captured by the consumer’s category-keyword map).
Single-topic-sentence edge cases: Pattern α only fires for multi-topic sentences (≥2 category keywords matched). Sentences with one dominant topic + a cross-category ADR cite stay on the legacy first-match check.
Multi-topic ADRs where the consumer’s category map assigns a single category but the ADR genuinely covers multiple topics. Consumer-side category-map expansion would address these.

Round 15 outcome#

✅ v1.4.0 shipped (commit 7c40490, PyPI live).
✅ #82 auto-closed by closes #82 directive on the v1.4.0 tag.
✅ 0 open issues on the eval-toolkit repo post-#82.
✅ ADR 0007 committed — three-layer architecture (identity + scope + pairing) codified as canonical for ALL audit_* validators. ADRs 0005/0006 now framed as family-wide via 0007.
✅ _narrative.py extracted — shared narrative-prose helpers; both validators import from one canonical location. Signature-preserving refactor; all v1.3.0 tests pass unchanged.
⚠ Consumer HARD-gate promotion: still a judgment call. The residual 37 includes some real misalignments worth fixing in consumer prose; HARD-gating now would block commits on those lines. Consumer’s v1.3.8 bundled-promotion plan should proceed only after triaging the 37.

The post-v1.0 consumer-feedback cycle (R11 → R15)#

The five rounds form a complete library-first cycle for the audit-validator family:

Round	Driver	Cycle time	Closure
R11	Consumer adopts v1.0.x audit-validator family	days	v1.0.4 (3 validators shipped)
R12	Consumer files #80 (BINDINGS slice-axis)	~2h	v1.1.0 + v1.2.0
R13	v1.1.0 dogfood surfaces context-filter gaps	~1h	v1.2.0
R14	Consumer files #81 (audit_value_bindings list-grammar)	~1.5h	v1.3.0 (Layer 3)
R15 (this)	Consumer files #82 (audit_citation_alignment scope+pairing)	~1.5h	v1.4.0 (Layer 2 + 3 + ADR 0007)

R11→R15 demonstrates the iterative consumer-feedback model as a substitute for heavyweight multi-LLM gate audits (closed at v1.0 per ADR 0003). Each round delivered closure within hours of the consumer filing, with same-day adoption. ADR 0007 captures the architectural template; future audit validators inherit the three-layer model and ship in fewer cycles.

Multi-LLM audit cadence after R15#

Unchanged from R11–R14: no multi-LLM cross-review for consumer-feedback rounds. R15’s design was informed by /exploring-options (3 rounds) + 2 dogfood-driven refinements (Pattern α generalization + symmetric-None skip). Multi-LLM dispatch remains reserved for v2.0 design cycles or major- severity findings.

Cross-references#

ADR 0007 — family-wide three-layer codification.
v1.4.0 CHANGELOG entry — full per-pattern detail with consumer adoption path.
tests/test_audit_citation_alignment.py — 18 tests (12 from v1.0.1 + 6 for v1.4.0 Patterns α/β/γ + backward-compat + shared-helpers + combined dogfood).

Audit findings ledger#

Round 5 (2026-05-21) — Codex + Gemini pre-implementation audit#

Round 6 (complete: 2026-05-21 — Codex + Gemini + manual; 11 findings)#

Round 6 v0.46.1 ship status (2026-05-21)#

Round 7 (complete: 2026-05-21 — Codex + Gemini; 3 substantive findings)#

Gemini observations (Round 7)#

Audit-as-seed extensions (v0.48)#

Round 7 ship status#

Round 8 (2026-05-24) — STOP-GATE before v1.0 tag, rectified in v0.51.0#

Confirmed (13) — fixed in v0.51.0#

Refuted (3) — recorded for audit trail; NO fix shipped#

Deferred to v1.x (2)#

Round 8 ship status#

Round 9 (2026-05-24) — STOP-GATE before v1.0 tag#

Confirmed (6) — disposition mapped#

Refuted (3) — recorded for audit trail; NO fix shipped#

Partial (1)#

Third-audit findings (Claude — modules neither auditor cited) — 3 substantive + 3 minor#

Round 9 ship status#

Multi-LLM audit-machinery calibration notes#

Round 10 (2026-05-25) — v1.0 pre-tag micro-audit#

Confirmed (3) — disposition mapped#

Partial (1) — accept-as-design#

Refuted (1) — Pattern-1 calibration record#

Multi-LLM audit-machinery calibration (R10)#

Round 11 (2026-05-26) — consumer-feedback-driven validator adoption#

Motivating findings (consumer audit origin)#

Upstream response timeline#

Audit-validator family status#

Pre-tag + post-v1.0.3 dogfood summary#

Round 11 outcome#

Multi-LLM audit cadence after R11#

Round 12 (2026-05-26) — schema-extensibility + scope-correctness lesson from #80#

Motivating finding (consumer audit origin)#

Upstream response timeline#

The two-layer correctness lesson (pending ADR 0005)#

Dogfood evidence#

Round 12 outcome#

Consumer adoption + HARD-gate viability#

Multi-LLM audit cadence after R12#

Round 13 (2026-05-26) — audit_value_bindings v1.2.0 context-aware noise reduction#

Motivating evidence (consumer-side, post-v1.3.11)#

Upstream design — /exploring-options 4 rounds + ultrathink redirect#

Dogfood evidence (compounded across the cycle)#

Tactical refinements during implementation#

Round 13 outcome#

Multi-LLM audit cadence after R13#

Cross-references#

Round 14 (2026-05-26) — Layer 3 pairing rules close cross-detector list-grammar (#81)#

Motivating residuals from v1.2.0 dogfood#

Upstream design — /exploring-options 2 rounds + Pattern D emerged during dogfood#

Layer 3 architecture (per ADR 0006)#

Dogfood evidence (compounded across the four rounds)#

Round 14 outcome#

The post-v1.0 consumer-feedback cycle (R11 → R14)#

Multi-LLM audit cadence after R14#

Cross-references#

Round 15 (2026-05-26) — audit_citation_alignment Layer 2 + Layer 3 + ADR 0007 family-wide architecture#

Architectural insight: ADR 0007 generalizes the model#

Three-layer model applied to audit_citation_alignment#

Upstream design — /exploring-options 3 rounds + Pattern α generalization + None-skip emerged during dogfood#

Dogfood evidence (audit_citation_alignment cycle)#

Round 15 outcome#

The post-v1.0 consumer-feedback cycle (R11 → R15)#

Multi-LLM audit cadence after R15#

Cross-references#

Round 13 (2026-05-26) — `audit_value_bindings` v1.2.0 context-aware noise reduction#

Upstream design — `/exploring-options` 4 rounds + ultrathink redirect#

Upstream design — `/exploring-options` 2 rounds + Pattern D emerged during dogfood#

Upstream design — `/exploring-options` 3 rounds + Pattern α generalization + None-skip emerged during dogfood#