Audit findings ledger#

This ledger tracks findings from each cross-model methodology audit (Gate 3 per ADR 0003 — to be drafted at v0.48). Each row records the finding ID, severity, finding summary, disposition (how it was resolved or scheduled), and a link to the tracked issue (where applicable).

Convention: blocker-severity findings get a p1-gate3-labelled GitHub issue for fix-tracking. Lower-severity findings are recorded here only.

Cross-references:

  • The audit prompt template is at gate3-audit-prompt.md (local — not in published docs).

  • The v1.0 plan that drives audit cadence is at ~/.claude/plans/evaluate-all-the-work-twinkly-kite.md (local).

  • Audit re-run schedule: after each breaking minor (v0.46, v0.47, v0.48) plus the original Round 5 pre-implementation pass. 7-day audit-completion timeout per gate.


Round 5 (2026-05-21) — Codex + Gemini pre-implementation audit#

Reviewers: author (manual) + Codex (independent report) + Gemini (independent report).

Packet: v0.44.0 code state + the v1.0 release plan (~/.claude/plans/evaluate-all-the-work-twinkly-kite.md) + docs/source/methodology/ (16 chapters) + docs/source/roadmap.md + CHANGELOG.md + src/eval_toolkit/*.py + existing migration guides (migration/v0.7.md, v0.8.md, v0.9.md).

Audit prompt: ~/.claude/plans/gate3-audit-prompt.md.

ID

Severity

Finding

Disposition

Issue

R5-F1

blocker before v0.46

scorecard() first-party metric list mixes threshold-free + threshold-dependent metrics; MetricSpec.compute(y, s) -> float has no threshold input

Decision R: drop F1/accuracy/precision/recall from v0.46 specs; keep metrics_at_threshold + ThresholdSelector as separate path

(in plan)

R5-F2

blocker before v0.46

Scorecard result type has no contract for undefined/skipped/error cells; existing MetricState infra unused

Decision S: reuse MetricState (ok/skipped/error) vocabulary from artifacts.py:30-61

(in plan)

R5-F3

blocker before v0.47

Sweep unification plan assumes DelimitVariant/DatamarkVariant classes exist; preprocessing.py exports only functions

Plan revised: create 3 preprocessing dataclasses at v0.47 as part of sweep work; sweep contract clarified (neutral default; explicit attack_threshold required for ASR)

(in plan)

R5-F4

blocker before v0.46

Plan’s scalar-deprecation directive would replace the lazy export resolver (__init__.py:302-312), shattering all root imports

Plan corrected: extend existing __getattr__ with a transitional deprecation branch (BEGIN/END markers); v0.47 removes only that branch, base resolver intact

(in plan)

R5-F5

blocker before v1.0

DeLong (DeLongResult, delong_roc_variance) is publicly exported + in API docs, but methodology + roadmap docs say “out of scope”

Decision U: keep public; update methodology/comparison.md, methodology/reading_list.md, roadmap.md to align with shipped state. Bundled into v0.48 §5E-prep

(in plan)

R5-F6a

packet drift

cv_clt_ci docstring claims “Bayle et al. 2020 prove a CV-CLT with a correction factor”; code does naive sample variance (no scalar correction)

Docstring fix bundled into v0.48 §5E-prep. Code is correct per Bayle (2020) Thm 3.1; docstring oversells

(in plan)

R5-F6b

packet drift

methodology/parallelism.md:143-181 says “as of v0.34, harness not yet parallelized” + “once #29/#30 land” — but v0.36 wired evaluate(n_jobs=) + evaluate_folded(n_jobs=)

Doc fix bundled into v0.48 §5E-prep. Also clarify bootstrap_ci’s n_jobs is studentized-only

(in plan)

R5-F6c

packet drift

methodology/testing.md:108-136 says reference-equivalence gap “closing in PR 1.5”; roadmap state shows it’s closed

Doc fix bundled into v0.48 §5E-prep

(in plan)

R5-F6d

packet drift

methodology/calibration.md:15-18 lists only temperature/isotonic/Platt; Beta + 4-binary-adapter family also shipped

Doc fix bundled into v0.48 §5E-prep

(in plan)

R5-F6e

packet drift + code

methodology/bootstrap.md two-level example uses same array for val + test, causing ~63.2% overlap when paired_bootstrap_op_point_diff resamples val/test independently

Doc fix + defensive code guard bundled into v0.48 §5E-prep: rewrite example with disjoint split + raise ValueError in paired_bootstrap_op_point_diff if val_y is test_y

(in plan)

R5-F6f

partial verified

CostSensitiveSelector formula t* = c_FP·(1-π) / (c_FP·(1-π) + c_FN·π) is the prior-corrected form; existing docstring already cites Elkan’s prior-independent variant — intentional design, but easy to misuse on already-calibrated probabilities

Docstring sharpening (add Warning admonition) bundled into v0.48 §5E-prep. Math itself is correct per the documented intent

(in plan)

R5-F7

governance risk

Gate 3 redefinition (multi-model cross-review) is useful but not the same evidence class as external academic peer review; the plan understated this

Decision O revised: ADR 0003 explicitly states Gate 3 at v1.0 is internal model-assisted cross-review, NOT external academic peer review; documents what it catches and doesn’t

(in plan)

R5-Px

plan internal inconsistency

v0.46 verification smoke imports TextTransform, but Decision K introduces it at v0.47

Plan smoke test corrected — TextTransform import removed from v0.46 smoke

(in plan)

Round 5 totals: 12 findings (5 blockers, 1 high-pre-v1, 5 packet drift, 1 governance, 1 plan inconsistency). All resolved or scheduled before any v0.46 code shipped.

Key follow-on decisions (driven by Round 5):

  • Decision W: packet-drift fixes (F6a–F6f) bundle into v0.48 polish (not earlier hotfix).

  • Decision X: scorecard skipped-status detection reuses is_metric_defined_for_slice from v0.39.0.

  • Decision Y: Round 6 / 7 / 8 audits run as formal stop-gates between v0.46 / v0.47 / v0.48 / v1.0; 7-day timeout per gate.

  • Decision Z: this ledger file.


Round 6 (complete: 2026-05-21 — Codex + Gemini + manual; 11 findings)#

Ship date: v0.46.0 tagged + published to PyPI 2026-05-21. STOP-GATE per Decision Y.2 — release/v0.47.0 opens once R6-F1 (BLOCKER) + R6-F2 (HIGH) are fixed via v0.46.1 hotfix and the remaining items are scheduled.

Reviewers: author (manual) + Codex (independent) + Gemini (independent).

Packet to send to Codex + Gemini:

  • This plan file (~/.claude/plans/evaluate-all-the-work-twinkly-kite.md)

  • This audit ledger (docs/source/audit_findings.md)

  • docs/source/methodology/ (16 chapters — unchanged from Round 5)

  • docs/source/roadmap.md

  • docs/source/migration/v0.46.md (NEW — consumer migration guide)

  • docs/source/adr/0002-scorecard-as-primary-metric-surface.md (NEW)

  • CHANGELOG.md (new v0.45.0 + v0.46.0 entries since Round 5)

  • Source tree, with focus on the v0.45 + v0.46 additions:

    • src/eval_toolkit/stacking.py (v0.45, ~370 lines)

    • src/eval_toolkit/scorecards.py (renamed from _scorecard.py at v0.49.0; ~585 lines)

    • src/eval_toolkit/metric_specs.py (v0.46, ~180 lines)

    • src/eval_toolkit/__init__.py__getattr__ deprecation branch + _DEPRECATED_SCALARS (Audit F4 invariant region)

    • src/eval_toolkit/metrics.py:120-200SINGLE_CLASS_INCOMPATIBLE_METRICS extension (Round-5 X.2 precondition)

Audit prompt: ~/.claude/plans/gate3-audit-prompt.md (local). The “Known issues already in the v0.48 backlog (skip re-reporting)” section already lists drift items scheduled for v0.48 polish — Round 6 reviewers should skip those and surface only NEW findings against the v0.46 state.

Focus areas for Round 6 review:

  • scorecard surface design lock-in. The Tier-2 MetricSpec Protocol freezes at v1.0 — method-signature changes require a v2.0 major bump. Last cheap chance to catch contract gaps.

  • MetricResult cell-state contract — does the ok / skipped / error vocabulary cover every relevant failure mode? Are the reason strings useful for triage?

  • Per-cell error isolation — confirm that catching all exceptions in _evaluate_spec doesn’t hide important failures the user should see.

  • __getattr__ deprecation shim — Audit F4 invariant: does the branch correctly route deprecated names, NOT break non-deprecated resolution, and cleanly delete at v0.47?

  • Spec name encoding for parameterized metrics — is "ece_n_bins_15_strategy_uniform" a stable v1.0 commitment, or does the alphabetize-kwargs rule create surprise keys for custom user specs with multi-kwarg signatures?

  • Scorecard.to_pandas() MultiIndex schema — first-time-public; any shape lock-in concerns?

Triage on findings: each blocker → p1-gate3-labelled GitHub issue + a row in this ledger. Either fix-as-v0.46.1-hotfix or fold into v0.47 design (per Decision Q severity-tiered hotfix policy).

ID

Reviewer

Severity

Finding

Disposition

Lands

R6-F1

Codex

BLOCKER before v0.47 opens

metric_specs.ece(strategy="typo") silently dispatches to quantile ECE and returns scorecard cell with status="ok" under invalid key ("ece_n_bins_15_strategy_typo"). Wrong-by-design data correctness path. Verified via Codex runtime probe.

Add strategy validation in ece() factory + _EceSpec.compute(); raise ValueError("ECE strategy must be 'uniform' or 'quantile'; got {strategy!r}") (plan §2.5A).

RESOLVED v0.46.1 (commit 7a4bb14, tag v0.46.1 2026-05-21; consumer pin bumped same day)

R6-F2

Codex + Gemini

HIGH before v0.47 scalar hard-removal

ECE deprecation warnings in __init__.py:_scorecard_spec_for() emit broken migration snippets for all 5 ECE variants. Two-part bug: (a) for the 2 variants in metric_specs, the suggested scorecard key uses the factory-call expression ("ece(n_bins=10)") instead of the encoded spec name ("ece_n_bins_10_strategy_uniform"); (b) for the 3 variants NOT in metric_specs (_debiased, _l2, _l2_debiased), the fallback name isn’t an importable spec. Gemini claimed pre-v0.46 default was n_bins=15 (verified incorrect — code at metrics.py:730-734 shows n_bins=10); Decision R6-F resolves: warning uses n_bins=10 to preserve pre-v0.46 math + adds migration note about new factory default.

Restructure _scorecard_spec_for() to return (factory_expr, scorecard_key, has_first_party) tuple; correct snippets for first-party variants with n_bins=10; submodule-path template for 3 non-first-party variants per Decision R6-G (plan §2.5B).

RESOLVED v0.46.1 (commit 7a4bb14, tag v0.46.1 2026-05-21; consumer-side smoke verified all 5 ECE-variant warnings + submodule-path routing)

R6-F3

Codex

HIGH before scorecard freeze

Duplicate MetricSpec.name values in the same scorecard() call silently overwrite earlier cells (last-wins). Not a documented contract.

Decision R6-B (locked): reject in scorecard() with ValueError("Duplicate MetricSpec name 'X' at index N; ..."). Forces caller to disambiguate; no silent data loss. (Plan §4G.)

v0.47

R6-F4 (= Gemini R6-F1)

Codex + Gemini

HIGH before v1.0

scorecard(seed=None) documented as non-deterministic; implementation coerces None 0. Doc/impl contradiction. Verified by Codex via bit-for-bit equality test.

Decision R6-A (locked): deterministic-by-default; fix docs only. No behavior change. Plan §4G-prep. (Decision R6-E: rolls to v0.47 — R6-A is non-blocker per Decision Q’s “docstring” category.)

v0.47

R6-F5

Codex

Contract-enforcement gap before v1.0

ADR 0003 promises strict Tier-2 Protocol method-shape stability; current public-API drift guard only snapshots (*args, **kwargs) for Protocol classes, not method signatures. The guard does not see changes to MetricSpec.compute, MetaLearner.fit, etc.

Decision R6-D (locked): extend tests/test_public_api.py snapshot to capture Protocol method signatures via inspect.signature + typing.get_type_hints for the 9 Tier-2 Protocols. (Plan §4I.)

v0.47

R6-F6

Codex

Packet drift

v1.0 plan + roadmap still describe pre-v0.46 scorecard shapes that didn’t ship: ece_n_bins_15 without strategy in plan, ece_quantile() factory listed (shipped as ece(strategy='quantile')), MetricUndefinedError mentioned (ADR 0002 chose no new public exception), n_resamples >= 100 floor (shipped is >= 1). Roadmap “Currently shipped” still says v0.44.

Plan §4L: refresh plan §3A scorecard examples + roadmap shipped-state section. Doc-only commit on v0.47 release branch.

v0.47

R6-F3 (Gemini)

Gemini

MEDIUM (schema lock-in before v1.0)

Scorecard.to_pandas() MultiIndex columns expose value, status, reason, ci_low, ci_high, confidence but drop n_resamples + method from BootstrapCI. Provenance loss compared to to_dict(). v1.0 is about to lock the schema.

Decision R6-C (locked): add n_resamples + method columns at v0.47 (additive). Schema becomes lossless against to_dict(). (Plan §4H.)

v0.47

R6-F4 (Gemini)

Gemini

LOW

MetricSpec Protocol doesn’t enforce stable parameterized-spec naming. Custom users implementing multi-kwarg parameterized specs can silently spawn distinct dict keys if constructor arg order varies.

Decision R6-H (locked): add make_spec_name(prefix, **kwargs) canonicalization helper in metric_specs.__all__ only (NOT top-level _EXPORTS — Tier-2 additive contract). Alphabetized kwargs, snake_cased, joined by underscore. (Plan §4J.)

v0.47

R6-F5 (Gemini)

Gemini

LOW

_evaluate_spec() wraps spec.compute() in broad except Exception. Swallows MemoryError, RecursionError, KeyboardInterrupt, SystemExit into cell state — process exhaustion / user-interrupt signals get hidden as metric errors.

Narrow exception catch: except (MemoryError, RecursionError, KeyboardInterrupt, SystemExit): raise first, then existing broad catch. (Plan §4K.)

v0.47

Round 6 totals: 11 findings (Codex 6 + Gemini 5; 2 overlap on seed=None + ECE deprecation snippets but with different reasoning angles). 1 BLOCKER (R6-F1) + 5 HIGH + 2 MEDIUM/contract + 3 LOW. All dispositioned to either v0.46.1 (2 fixes) or v0.47 (9 fixes).

Key follow-on decisions (driven by Round 6 — locked in plan):

  • Decision R6-A: seed=None deterministic-by-default; fix docs only.

  • Decision R6-B: Reject duplicate MetricSpec.name with ValueError.

  • Decision R6-C: Add n_resamples + method to to_pandas() schema.

  • Decision R6-D: Extend public-API snapshot to cover Protocol method signatures.

  • Decision R6-E: v0.46.1 scope = R6-F1 + R6-F2 only; R6-A rolls to v0.47 (non-blocker per Decision Q’s “docstring” category).

  • Decision R6-F: Use n_bins=10 (pre-v0.46 default) in deprecation warnings + migration note about new v0.46+ factory default of n_bins=15. Corrects Gemini’s misverified pre-v0.46 default claim.

  • Decision R6-G: 3 ECE variants without metric_specs (debiased, l2, l2_debiased) route deprecation warnings to submodule path; do NOT add to metric_specs at v0.47.

  • Decision R6-H: make_spec_name() helper in metric_specs submodule only; not top-level.

Round 6 v0.46.1 ship status (2026-05-21)#

  • R6-F1 ✅ SHIPPED in v0.46.1 (PR #67, squash 7a4bb14). End-to-end verified in consumer: ms.ece(strategy="typo") raises ValueError; direct _EceSpec(strategy=...) construction also raises (defence-in-depth).

  • R6-F2 ✅ SHIPPED in v0.46.1 (PR #67, squash 7a4bb14). End-to-end verified in consumer: eval_toolkit.expected_calibration_error warning carries ece(n_bins=10) + key ece_n_bins_10_strategy_uniform + migration note about v0.46+ n_bins=15 default. All 3 non-first-party variants (debiased, l2, l2_debiased) route to from eval_toolkit.metrics import submodule path.

  • 9 other Round 6 items (R6-A docstring, R6-B duplicate-name guard, R6-C to_pandas schema, R6-D Protocol method-shape snapshot, R6-F4-Gemini make_spec_name, R6-F5-Gemini narrow except, R6-F6 plan/roadmap state-drift) → folded into release/v0.47.0 per Decision R6-E.

  • Round 6 STOP-GATE status: CLOSED. release/v0.47.0 can open after the v0.46.1 consumer cycle observation completes (1 cycle).


Round 7 (complete: 2026-05-21 — Codex + Gemini; 3 substantive findings)#

Reviewers: author (manual) + Codex (independent report) + Gemini (independent report).

Packet: v0.47.0 code state + the v1.0 plan + docs/source/methodology/ (16 chapters) + ADRs 0001/0002/0003 + docs/source/migration/v0.46.md + docs/source/migration/v0.47.md + Round 5/6 ledger.

Round-7 briefing: gate3-audit-round-7.md (committed a9e1114).

Reports: gate3-audit-round-7-codex-report.md + gate3-audit-round-7-gemini-report.md (untracked per .gitignore).

Headline: Codex 3 substantive findings; Gemini 0. Overlap was zero between the two reports — the most consequential finding (R7-F1 doc-migration boundary gap between Sybil-tested fences and MyST-NB-executed example notebooks) was Codex-only. Reinforces the Round 6 pattern (do not use overlap as a confidence floor; single-reviewer findings can be the most critical).

ID

Reviewer

Severity (their words)

Finding

Disposition

Lands

R7-F1

Codex

high before v0.48

v0.47 doc migration missed MyST-NB executable example notebooks (separate from Sybil-collected .md fences). 6 example pages + 4 module-level docstrings + protocols.md autosummary + roadmap wording still reference removed APIs. Docs CI runs sphinx-build without -W, so notebook execution failures pass as advisory warnings. Verified via sphinx-build runtime probe — 6 execution failures buried in the warning stream.

Decision R7-A (locked at /exploring-options Q3): bundle into v0.48 §5G/§5H. §5G migrates the 6 notebooks + 4 docstrings + autosummary + roadmap; §5H enables nb_execution_raise_on_error = True in conf.py. Audit-as-seed expansion (Q2 locked full sweep) covered ALL module docstrings + drift in 5 existing api/*.md autosummary lists + 8 missing api/*.md pages.

RESOLVED v0.48.0 (§5G commit e07db16 + §5H commit 6349472 on release/v0.48.0)

R7-F2

Codex

high before sweep freezes

sweep() records only strategy.name per row; two configured instances of same dataclass (e.g., DelimitVariant(delimiter="<<") + DelimitVariant(delimiter="[[")) silently merge under groupby("variant"). Style-coherent defect class with Round 6 R6-F3 (scorecard duplicate name) but with different semantics (row container vs. Mapping).

Decision R7-B option C (locked): emit strategy_id canonical column AND reject duplicate strategy_id at sweep boundary. Style invariants 1 (no silent failures) + 2 (natural call pattern is right) + 4 (canonical identifier + reject in canonical dimension) read together.

RESOLVED v0.48.0 (§5I commit f454afe)

R7-F3

Codex

worth fixing before v1.0

sweep() doesn’t validate scorer output cardinality. Three failure modes via runtime probe: overlong 1-D → silent truncation (worst); short 1-D → IndexError later; (n,2) matrix → TypeError when float() applied.

Decision R7-C (locked): API-level ValueError with contextual label at the sweep boundary; replaces all three low-level failure modes. Style invariants 1 + 3.

RESOLVED v0.48.0 (§5J commit fcf99f0)

Gemini observations (Round 7)#

Gemini’s report verdict was “highly stable; release/v0.48.0 is safe to open.” Six minor observations / validations; nothing critical that Codex hadn’t covered. The actionable items folded into v0.48:

  • §1-3 + 5-7: VALIDATIONS of v0.47 shipped state (TextTransform shape, shim removal, sweep design, R6-D Protocol method-shape snapshot, ADR 0003 tiers). No action needed.

  • §4 (pedagogical drift): Gemini noted “from eval_toolkit.metrics import pr_auc” is syntactically green but slightly undermines ADR 0002. v0.48 §5G migration explicitly chose scorecard() for example notebooks teaching METRIC USAGE; submodule path only where teaching the underlying math.

  • §4 (Makefile pre-push): Gemini recommended hardening to prevent the pytest tests/ path-override trap. Landed as v0.48 §5L (make pre-push target running all 3 doc-execution surfaces; commit 9878a54).

  • §5 (R6-C dtype coercion): Gemini noted n_resamples (int + NaN) → float64 is an accepted tradeoff. Landed as v0.48 §5K (Notes section on Scorecard.to_pandas() docstring; commit 6304cea).

  • §6 (SynonymSubstitution whitelist): Gemini recommended adding a docstring note about the hardcoded 6-entry whitelist. Landed as v0.48 §5K (adversarial.py Notes section; commit 6304cea).

Audit-as-seed extensions (v0.48)#

Per user direction during plan refinement (“use the audits as seeds for things to reconsider”), the Round 7 findings + style-invariants framing surfaced additional v0.48 scope beyond Codex’s explicit list:

  • §5G expansion: from 4 Codex-flagged module docstrings to full sweep across src/eval_toolkit/ module docstrings + audit of all docs/source/api/*.md autosummary pages. Found 8 missing API pages + 5 drifted autosummary lists.

  • §5M new: in-source docstring drift audit (third doc-execution surface). Result: 82 PASS / 1 skipped / 0 fail; expanded .doctest-modules from 11 → 21 modules so CI catches future drift.

  • §5N comprehensive: cross-API shape-validation consistency sweep beyond Codex’s R7-F3 target. Audited metrics_at_threshold, paired_bootstrap_op_point_diff, bootstrap_metric_from_predictions, metrics.py scalars, fit_*_binary / fit_*_calibrator. Tightening commit landed for metrics_at_threshold silent threshold semantics (commit 76773dc); paired_bootstrap_op_point_diff val_y is test_y guard landed as part of §5E-prep code-side fix (commit 5c8e68d).

Round 7 ship status#

  • 3 substantive Codex findings: all RESOLVED in v0.48.0 via §5G + §5H + §5I + §5J.

  • 6 Gemini observations: all RESOLVED in v0.48.0 via §5G + §5K + §5L.

  • Audit-as-seed extensions (§5G expanded, §5M new, §5N comprehensive): all RESOLVED in v0.48.0.

  • Round 7 STOP-GATE status: CLOSED via v0.48.0 release. Round 8 audit STOP-GATE per Decision Y.2 opens against the v0.48.0 state before v1.0.0 tag can land.


Round 8 (2026-05-24) — STOP-GATE before v1.0 tag, rectified in v0.51.0#

Multi-LLM cross-review fired against the v0.50.0 state. Codex returned 10 findings (3 with reproducible probes: C3, C4(a)/(b), C9); Gemini returned 5 findings + 3 positive validations (V1 style, V2 methodology, V3 scorer_error safety). Independent verification by Claude (audit-verification-codex-gemini-v0.50.0.md, 510 lines, untracked per .gitignore) confirmed 13 of 18 items, refuted 3 (R8-G2, R8-G5, R8-V1+R8-V2 paired-as-one over-confident Gemini validations), and deferred 2 (R8-G3, R8-G4) to v1.x as Tier-2 additive. The v0.51.0 release ships fixes for all 13 confirmed items; 2 (G3, G4) are deferred to v1.x as Tier-2 additive; 3 refuted entries are recorded below for the audit trail.

Round 9 multi-LLM cross-review runs against the v0.51 RC before v1.0 tags.

Confirmed (13) — fixed in v0.51.0#

ID

Severity

Finding

Disposition

Commit

R8-C3

blocker (escalated from Codex fix-rec)

recall_at_fpr fallback violated FPR ceiling (actual_fpr=1.0 when target_fpr=0.0); probe y=[0,1], scores=[1.0,1.0].

v0.51 sentinel threshold=np.inf, actual_fpr=0.0, fp=0.

672d45f

R8-C4(a)

blocker (escalated)

Generator-rng not parallel-stable in _score_all_slices; n_jobs=1 vs n_jobs=2 produced different CIs; SPEC 7 contract violation.

v0.51 spawns child SeedSequences at the dispatch boundary.

61964f6

R8-C4(b)

blocker (escalated)

spawn_seed_sequences ignored Generator state (extracted bit_generator.seed_seq).

v0.51 draws fresh entropy via rng.integers(...).

87453f6

R8-C1

fix-recommended

evaluate_folded(seeds=) reused the splitter across the seed loop — partitions never varied.

v0.51 adds reseed_splitter callback + DeprecationWarning when multi-seed + None. Warning persists past v1.0.

f60d43c

R8-C2

fix-recommended

SourceDisjointKFoldSplitter.iter_folds yielded empty test partitions when k > n_sources; get_n_splits returned min(k, n_sources).

v0.51 caps iter_folds at min(k, n_sources) + UserWarning.

f60d43c

R8-C5

fix-recommended

README links to docs/... (broken); migration toctree listed only v0.7–v0.9 despite v0.49–v0.51 BREAKING releases.

v0.51 repoints all README links to docs/source/...; toctree + MIGRATION.md index extended through v0.51; v0.49/v0.50/v0.51 migration guides authored.

c206b54

R8-C6

fix-recommended

calibration.reliability_curve, maximum_calibration_error did not validate y_score [0,1]; fit_temperature did not validate bounds.

v0.51 Tier-2 additive validation matching metrics.py-side ECE rigor.

4c5e140

R8-C7

fix-recommended

Tests covered counts where semantics matter: test_harness_folded.py:92-107 no partition-content check; test_harness_parallelism.py:77-122 no Generator-rng; test_protocol_conformance.py:228-240 no k>n_sources.

v0.51 adds semantic-property regression tests bundled with C1/C2/C4 fixes.

87453f6 + 61964f6 + f60d43c

R8-C8

minor

SimilarityStrategy listed as Tier-2 in README + extending.md but absent from strict_tier2_protocols.md’s 9-strict list.

v0.51 demotes SimilarityStrategy to “pre-v0.7 internal interface” in README + extending.md (aligns to v1.0 contract per ADR 0003).

c206b54

R8-C9

minor

GateResult.to_dict() returned numpy + NaN unchanged; json.dumps(..., allow_nan=False) raised TypeError on the result.

v0.51 docstring documents the JSON-safety contract — strict-JSON requires artifacts.write_json_strict or sanitize_for_json.

c206b54

R8-C10

minor

.gitignore patterns covered only gate3-audit-*-report.md; new codex-comprehensive-audit-*-report.md + audit-verification-*.md were tracked by default.

v0.51 .gitignore extended with new patterns.

c206b54

R8-F1

fix-recommended (Claude — missed-by-both)

losses.RecallAtLowFPR.__init__ did not validate pos_weight > 0; non-positive values produced degenerate-but-bounded loss values silently.

v0.51 eager validation matching sibling-kwarg pattern.

4c5e140

R8-F2

minor (Claude — missed-by-both)

metric_specs.ece(n_bins=) validated strategy eagerly but deferred n_bins to compute time.

v0.51 eager _validate_n_bins call at factory level.

4c5e140

R8-F3

minor (Claude — missed-by-both)

analysis.CsvPredictionReader.read_predictions silently filled missing CSV columns with empty strings → cryptic dtype error downstream.

v0.51 detects missing columns at read time; raises actionable ValueError.

4c5e140

Refuted (3) — recorded for audit trail; NO fix shipped#

ID

Severity (as auditor marked it)

Finding (rejected)

Rationale

R8-G2

fix-recommended (Gemini)

“Cyclic import leakage→harness identified but left unresolved.”

REFUTED: harness.py:709-712 resolves the cycle via TYPE_CHECKING + lazy noqa: PLC0415 import; comment explicitly states the resolution. Cycle is deliberately broken by an asymmetric-by-design pattern, not unresolved.

R8-G5

minor (Gemini)

“Plotting tests use weak assertions (assert fig is not None).”

REFUTED: tests/test_plotting_edge.py:125-140 does have that weak assertion, but the same file has structural assertions (line 92 assert out is fig; lines 220, 230 .get_title() equality). Cherry-picked; not representative.

R8-V1

“Style + consistency: Exceptional” (Gemini)

Validation that the axis is exceptional.

REFUTED: Gemini’s specific tooling claim (Ruff + Black + Mypy strict compliance) is true, but the axis is broader. Codex C1 + C5 + C6 + C7 + C8 + R8-G3 confirm substantive cross-API / docs / contract consistency drifts. Tooling-clean ≠ axis-exceptional.

R8-V2

“Domain methodology: Masterclass” (Gemini)

Validation that methodology is masterclass.

REFUTED: Codex C1 + C3 + C4 + C6 confirm methodology bugs at fix-recommended-or-blocker severity. Methodology has real flaws; “Masterclass” overstates. Gemini’s evidence (research-grounded tests exist) conflates research-citation discipline with methodology correctness.

R8-V3

“scorer_error catches predictions safely” (Gemini)

Validation that the mechanism is safe.

CONFIRMED, recorded for completeness: harness.py:468-474 correctly re-raises MemoryError + AssertionError BEFORE the broad except Exception, so critical exceptions escape the silent-capture path.

Deferred to v1.x (2)#

ID

Severity

Finding

Disposition

R8-G3

fix-recommended (Gemini)

Codebase uses ValueError for every error condition (295 raises in src/; 0 custom exception classes).

Deferred to v1.x as Tier-2 additive. Custom exception hierarchy subclassing ValueError preserves catch-compatibility; adding it post-v1.0 is non-breaking. Reconsider when a downstream consumer requests it.

R8-G4

minor (Gemini)

joblib OOM hazard documented at _parallel.py:55-59 but not structurally mitigated (no memory-aware n_jobs capping in harness.evaluate).

Deferred to v1.x. Memory-aware capping needs RAM-measurement + dataframe-size accounting — non-trivial, no clear best-practice. Caller is responsible for n_jobs sizing under their RAM budget at v1.0.

Round 8 ship status#

  • 13 confirmed findings: all RESOLVED in v0.51.0 via the per-finding commits above (Phase 1 + 2 + 3 + 4 on release/v0.51.0).

  • 3 refuted findings (R8-G2 cyclic-import framing, R8-G5 cherry-picked weak test, R8-V1+R8-V2 over-confident Gemini validations paired-as-one): recorded above; no shipped fix needed.

  • 2 deferred findings (R8-G3 custom exceptions, R8-G4 joblib OOM capping): Tier-2 additive; v1.x or later.

  • Tally: 13 + 3 + 2 = 18 total (canonical; aligned with CHANGELOG.md v0.51 section + migration/v0.51.md per RC4 reconciliation).

  • Round 8 STOP-GATE status: CLOSED via v0.51.0 ship. Round 9 audit STOP-GATE per Decision Y.2 opens against the v0.51.0 RC before v1.0.0 tag can land.


Round 9 (2026-05-24) — STOP-GATE before v1.0 tag#

Round 9 multi-LLM cross-review fired against the release/v0.51.0 RC (7 commits at the time of audit; PR #75 draft, CI green). Codex returned 4 substantive findings (RC1-RC4) + self-validation worklog (69/69 v0.51 regression tests pass); explicitly skipped the 10 modules neither Round-8 auditor cited + 4 cross-cutting hunts. Gemini returned 6 items (RG1-RG6): 3 design challenges (2 escalating already-locked v0.51 decisions to v1.0 blockers) + 3 positive validations; zero probes, zero line citations, ~45% validation density. Independent verification by Claude (audit-verification-round-9-v0.51.0.md, 228 lines; gitignored per R8-C10) confirmed 6 / refuted 3 / partial 1 of the 10 source-report items, AND surfaced 3 third-audit findings + 3 minors by hunting the modules Codex skipped. Of those, F-sweep-1 was a CANDIDATE v1.0 BLOCKER (R7-C “no silent failures” invariant violation on NaN scorer output) that neither Round-8 nor Round-9 auditor caught.

Per the Path 3 lock from the post-R9 planning session, two third-audit findings (F-sweep-1 + F-bootstrap-1) ship fixed in this PR; RC1 already fix-in-PR at 4c43771; the remaining items defer to v1.0.1.

Confirmed (6) — disposition mapped#

ID

Severity

Finding

Disposition

Commit / Defer-to

R9-RC1

fix-recommended (Codex)

README.md:208 has 2 broken docs/extending.md + docs/examples/claims_and_gates.md hyperlinks — R8-C5 regression (only the API-surface table-cell links; other R8-C5 sites are correct).

Fixed in PR.

4c43771

R9-RC2

fix-recommended (Codex)

SimilarityStrategy contract mismatch: __init__.py:294 exports it + snapshot.json:1103-1117 pins protocol_methods (de facto strict Tier-2 STRICT contract) BUT extending.md:18 + strict_tier2_protocols.md 9-strict list omit it. Snapshot IS the v1.0 contract regardless of docs.

Deferred to v1.0.1. Two paths: add to strict-Tier-2 list (10 strict + 1 opt-in; ADR 0003 update) OR remove from __all__ + snapshot.

v1.0.1

R9-RC3

fix-recommended (Codex)

test_evaluate_folded_reseed_splitter_varies_partitions at tests/test_harness_folded.py:134-164 only asserts key existence + count; no row-content comparison across seeds. Docstring claims partitions differ but assertions don’t check it. Directly REFUTES Gemini RG6.

Deferred to v1.0.1 test hardening.

v1.0.1

R9-RC4

minor (Codex)

“3 refuted” counts categories (V1+V2 paired as one) while 4 named items listed; defensible across all 4 surfaces but invites confusion.

Deferred to v1.0.1 docs polish.

v1.0.1

R9-RG4

validation (Gemini)

metrics.py SINGLE_CLASS_INCOMPATIBLE_METRICS pattern is “elegant”.

Validation accurate; pattern at metrics.py:35-99 is well-designed.

R9-RG5

validation (Gemini)

bootstrap.py defaults match “industry standards”.

Validation accurate: bootstrap.py:66-69 shows N=1000, α=0.95, BCa, seed=42 — all match common conventions. Terminology nit: Gemini used outdated “seed=” instead of v0.50 “rng=”.

Refuted (3) — recorded for audit trail; NO fix shipped#

ID

Severity (as auditor marked it)

Finding (rejected)

Rationale

R9-RG2

v1.0 blocker (Gemini)

“DeprecationWarning without sunset date is structurally wrong; v1.0 blocker.”

REFUTED: re-litigates Q3-locked v0.51 design. harness.py:1413-1414 shows the deliberate implementation; lines 1347-1349 + 1368 document “warning persists past v1.0” as chosen contract. Tested by test_evaluate_folded_multi_seed_without_reseed_emits_deprecation_warning. Meta-pattern: Gemini escalates without engaging with locked reasoning (same as R8 V1+V2).

R9-RG3

v1.0 blocker (Gemini)

“iter_folds capping silently is wrong; should raise.”

REFUTED: re-litigates R8-C2 design lock. splits.py:325-339 shows the deliberate cap-with-UserWarning implementation; warn-vs-raise IS a judgment call; the v0.51 decision was permissive-by-design (caller can opt into warnings.filterwarnings('error', UserWarning) for strictness).

R9-RG6

validation (Gemini)

“Regression tests apply strict invariant assertions, not weak proxies.”

REFUTED — directly contradicted by RC3 which I independently confirmed. tests/test_harness_folded.py:134-164 is precisely the weak-proxy pattern Gemini claims doesn’t exist. Gemini didn’t read the cited tests; textbook over-confident validation.

Partial (1)#

ID

Severity

Finding

Disposition

R9-RG1

accept-with-rationale (Gemini)

np.inf sentinel from recall_at_fpr risks silent downstream failure.

PARTIAL: internally NO callers (grep across src/ returned only function def + docstring mentions + an unrelated local var in losses.py:156). Sentinel risk is real only at user-code boundary; mitigated by Q1-locked design + docstring’s “caller may filter via np.isinf” guidance. Accept-with-rationale at v1.0.

Third-audit findings (Claude — modules neither auditor cited) — 3 substantive + 3 minor#

ID

Severity

Finding

Disposition

Commit / Defer-to

R9-F-sweep-1

blocker for v1.0 (Claude escalation)

_sweep.py:_validate_scorer_output() accepts NaN/inf scorer outputs without validation; R7-C “no silent failures” invariant violation in a module Codex R7 designed but didn’t audit for finiteness. Stacking.py validates non-finite scores; sweep.py didn’t.

Fixed in PR. NaN/inf check added to R7-C boundary; test bundled.

New commit on release/v0.51.0 (this PR)

R9-F-bootstrap-1

fix-recommended

scipy’s BCa returns degenerate CIs (ci_low == ci_high == point or NaN bounds) on small n + ceiling/floor metrics; scipy emits DegenerateDataWarning but doesn’t raise; pre-v0.51 R8-C4(b) RNG bug may have masked this.

Fixed in PR. UserWarning added at the bootstrap_ci boundary when BCa degenerates; test bundled.

New commit on release/v0.51.0 (this PR)

R9-F-bootstrap-2

minor

mde_from_ci width check if width <= 0 doesn’t catch NaN width (NaN <= 0 is False in IEEE float); silent NaN MDE if BCa returns NaN bounds.

Bundled into the F-bootstrap-1 commit. or not np.isfinite(width) guard added.

New commit on release/v0.51.0 (this PR)

R9-F-metrics-1

minor (Claude)

metrics.py:1305 Brier-decomposition docstring claims “≈ 1e-9” precision but actual error ~3e-3 dominated by binning. Self-contradictory.

Deferred to v1.0.1 docs polish.

v1.0.1

R9-F-metrics-3

fix-recommended (Claude)

expected_calibration_error(y=[0,0,1,1], y_score=[0.5,0.5,0.5,0.5], n_bins=10) returns 0.0 — technically correct per formula but semantically misleading (uninformative model looks perfectly calibrated). Equal-mass binning correctly rejects n<n_bins; equal-width has no guard.

Deferred to v1.0.1; either guard against all-same-score input OR explicitly document.

v1.0.1

R9-F-metrics-4

minor (Claude)

brier_score single-class docstring ambiguous about whether single-class is supported; implementation correctly handles it but wording is “implementation detail” not contract statement.

Deferred to v1.0.1 docs polish.

v1.0.1

Round 9 ship status#

  • 6 confirmed source-report findings: 1 fixed in PR (RC1 → 4c43771), 4 deferred to v1.0.1, 2 validation-confirmations (RG4, RG5 — no action).

  • 3 refuted source-report findings (RG2, RG3, RG6): recorded above for audit trail; no shipped fix.

  • 1 partial (RG1): accept-with-rationale at v1.0.

  • 2 third-audit fixes shipped in this PR: F-sweep-1 (CANDIDATE BLOCKER closed); F-bootstrap-1 + F-bootstrap-2 bundled (degeneracy warning + NaN guard).

  • 4 third-audit items deferred to v1.0.1: F-metrics-1, F-metrics-3, F-metrics-4, RC2 (SimilarityStrategy contract resolution).

  • Round 9 STOP-GATE status: CLOSED with mitigations per the Path 3 lock. v1.0 can tag from this PR post-CI-green.

Multi-LLM audit-machinery calibration notes#

  • Codex R9 quality: 4/4 findings hold; RC1 over-cited (3 sites named but only 1 had real broken links); no padding finding this round. Gap remains: explicit skip of the 10 untouched modules + 4 cross-cutting hunts. F-sweep-1 (the candidate blocker) lives in precisely that gap. Future-round briefings should make the untouched-modules + hunts mandatory.

  • Gemini R9 quality: 2/6 validations honest (RG4, RG5); 3/6 REFUTED (RG2, RG3, RG6). RG6 is the new V1/V2 pattern — uncited positive validation on tests Gemini didn’t read; directly contradicted by Codex’s RC3 + my verification. RG2 + RG3 escalate locked design without engaging with the recorded rationale.

  • Trust calibration: Codex’s probe-backed findings → high trust; Codex’s untouched-module gaps → fill with Claude third-audit. Gemini’s validations → unreliable; Gemini’s design dissent → “raises the question” not “v1.0 blocker.”


Round 10 (2026-05-25) — v1.0 pre-tag micro-audit#

Scoped Codex + Gemini micro-audit on the edadddc R9 follow-on commit only (full ~/Claude/audit-templates/audit-prompt.md template with a scope-override preamble constraining attention to ~126 LOC src + ~110 LOC tests in _sweep.py + bootstrap.py). Dispatched per the Round 10 locking decision: targeted micro-audit, not full multi-LLM gate re-run. Codex returned a 311-line report with 3 substantive findings + per-axis verdicts + 4-test probe verification (all 4 passed). Gemini returned an 86-line report with 2 findings (1 self-recommended-status-quo + 1 fix-recommended on metrics.py finiteness — direct contradiction with Codex’s “Completeness” verdict). Claude verification re-read every cited line. Reports archived locally at codex-microaudit-edadddc-report.md + gemini-microaudit-edadddc-report.md (gitignored).

Confirmed (3) — disposition mapped#

ID

Severity

Source

Finding

Disposition

Commit

R10-F1

fix-recommended

Codex

_sweep.py:298-299 error message says “finite floats in [0, 1]” but check is np.isfinite only — no range enforcement. Cross-ref protocols.py:29-51 Scorer Protocol docstring also lacked explicit [0, 1] contract statement.

Fixed in PR (Codex Option C): extend Protocol docstring to document [0, 1] calibrated-probability semantics + reword sweep runtime message to drop [0, 1] (boundary still doesn’t enforce range; enforcement deferred to a future minor once consumer usage patterns clarify).

R10 follow-on commit on release/v0.51.0

R10-F2

fix-recommended

Codex

tests/test_bootstrap_unit.py:337-343 BCa degeneracy test uses if ci.ci_low == ci.ci_high == ci.point_estimate: — silently no-ops when scipy returns NaN bounds (NaN==NaN is False in IEEE float). Test passes WITHOUT proving the warning fires. Codex probe-verified the no-op path on the current scipy fixture.

Fixed in PR (Codex Option A): mirror the production predicate — (not np.isfinite(low)) or (not np.isfinite(high)) or (low == high == point). Assertion block runs whenever ANY degeneracy mode fires.

R10 follow-on commit on release/v0.51.0

R10-F3

minor-observation

Codex

bootstrap.py:1099-1105 Raises docstring says “non-positive width” but :1132-1140 implementation rejects on width <= 0 or not np.isfinite(width). Code stricter than docs.

Fixed in PR (Codex Option A + brief Option B context): update Raises text to “non-positive or non-finite width” + 4-line note explaining scipy BCa NaN-bound motivation.

R10 follow-on commit on release/v0.51.0

Partial (1) — accept-as-design#

ID

Severity

Source

Finding

Disposition

R10-RG1

minor-observation

Gemini

bootstrap.py:376-386 BCa degeneracy check uses output proxy (ci_low == ci_high == point + non-finite bounds) rather than catching scipy’s internal DegenerateDataWarning. May miss edge-case degeneracies where jackknife acceleration is undefined but bounds don’t exactly collapse.

Accept-as-design. Gemini’s own recommendation was Option B (status quo): “Deterministic, fast, and covers the primary ‘small n + ceiling metric’ failure mode perfectly.” Output-proxy approach avoids warnings.catch_warnings overhead on a hot path. Catching internal DegenerateDataWarning is a potential v1.1 enhancement if real-world degeneracy modes surface that the output proxy misses.

Refuted (1) — Pattern-1 calibration record#

ID

Severity (as Gemini marked it)

Source

Finding

Rationale

R10-RG2

fix-recommended

Gemini

metrics.py functions like pr_auc and roc_auc lack a similar np.isfinite boundary check.”

REFUTED. metrics.py:_validate_inputs (line 1846) explicitly contains if not np.isfinite(y_score_arr).all(): raise ValueError("y_score contains NaN or inf"). Called at 20+ public-function sites (lines 366, 441, 479, 552, 650, 802, 872, 942, 1036, 1157, 1261, 1337, 1429, plus more). The check predates the audit chain (the comment “harmonizes with score_distribution_summary’s own guard” suggests pre-v0.46 vintage). Gemini did not read the cited code before making the claim — textbook Pattern-1 (validation-without-reading) violation, despite the micro-audit prompt’s Calibrated-confidence discipline section explicitly calling out this exact pattern. The calibration record is preserved here.

Multi-LLM audit-machinery calibration (R10)#

  • Codex R10 quality: 3/3 substantive findings hold; probe-backed; cited file:line on every observation; honest “What I didn’t look at” calibration. Same depth and rigor as R8 + R9.

  • Gemini R10 quality: 1/2 findings honest (F1, with self-recommendation = status quo — effectively an open-question framed as a finding); 1/2 REFUTED (F2, Pattern-1 violation). The prompt’s explicit Pattern-1 discipline section did NOT change the outcome. Pattern-1 is now confirmed across R8 V1+V2, R9 RG6, and R10 RG2 — four independent rounds. Conclusion: Gemini’s training-trace defaults toward positive validation without reading; explicit prompt-level discipline is necessary but not sufficient to correct it. Practical implication: future audits should treat Gemini validations as “raises the question” and require Claude (or Codex) read-back before accepting any positive validation.

  • Round 10 STOP-GATE status: CLOSED with mitigations — 3 Codex confirmed findings fixed in this RC; 1 Gemini accept-as-design; 1 Gemini refuted. v1.0 can tag from the R10 follow-on commit post-CI-green.


Round 11 (2026-05-26) — consumer-feedback-driven validator adoption#

Not a multi-LLM gate-style audit. Round 11 is the first consumer-feedback-driven round: consumer-side audit findings → upstream validator shipped → consumer adopts. Multi-LLM cross-review (Rounds 5–10) was the v1.0 stability-contract activation discipline; Round 11 is the v1.x ongoing consumer-feedback loop. Future Round 12+ would resume multi-LLM dispatch only if v2.0 work or a major audit cycle re-opens. This is flagged so readers don’t confuse R11 with the heavyweight gate-style rounds.

Motivating findings (consumer audit origin)#

  • V1.3.1 ADR-080 audit-fix (2026-05-22, consumer prompt-injection-detection-prototype): WRITEUP_NARRATIVE.md:38 said “TF-IDF + LR reaches 0.974 AUPRC” but canonical TF-IDF direct val AUPRC = 0.971 (0.974 is LoRA’s value). The existing audit_numbers.py validates VALUES against source data but not BINDINGS — both 0.971 AND 0.974 exist in the expected set; the bug is the wrong pairing.

  • V1.3.2 P1-2 Part 2 (2026-05-23, consumer same repo): docs/REPRODUCIBILITY.md:76 cited “(locked at Phase 0-07 via ADR-029)” — ADR-029 is the test-marker-strategy ADR, not the reproducibility tier-lock ADR (which is ADR-034). 5-digit-off citation; lychee resolves it (ADR-029 exists), audit_numbers.py validates values, but neither catches semantic mis-citation.

Upstream response timeline#

Time (UTC)

Event

2026-05-24

Consumer files #71 (value_bindings), #72 (concept_drift), #73 (citation_alignment) at v1.3.3

2026-05-24 (later)

PR #74 opens with audit subpackage skeleton + validate_citations

2026-05-25 17:01Z

v1.0.0 tag (stability contract activates per ADR 0003)

2026-05-25 17:08Z

v1.0 snapshot regen hotfix (eb94d26)

2026-05-25 22:51Z

PR #74 merged (audit_citation_alignment flat-module per ADR 0001; restructured from initial subpackage shape)

2026-05-25 23:48Z

v1.0.1 shipsaudit_citation_alignment + RC4 docs polish

2026-05-26 01:02Z

v1.0.2 ships — #76 cleanup batch (RC2 SimilarityStrategy promoted to 10th strict Tier-2; RC3 test row-content hardening; F-metrics docstring polish)

2026-05-26 01:48Z

Consumer PR #10 merged at v1.3.7 — consumer-side audit_citation_alignment adoption complete

2026-05-26 02:35Z

v1.0.3 shipsaudit_value_bindings + closes #71

End-to-end cycle: V1.3.2 P1-2 origin (2026-05-23) → consumer adoption (2026-05-26 01:48Z) = ~3 days. Upstream-side v1.0.0 → v1.0.3 (audit-validator family of 2): ~10h compressed cycle. Consumer’s v1.3.7 PR opened against audit_citation_alignment (the v1.0.1 deliverable); subsequent v1.0.3 audit_value_bindings adoption will happen on consumer’s own cadence.

Audit-validator family status#

Validator

Module

Released

Issue

Status

Citation alignment

audit_citation_alignment

v1.0.1 (067d4b0)

#73

✅ shipped + consumer-adopted (consumer PR #10 v1.3.7)

Value bindings

audit_value_bindings

v1.0.3 (775f3da)

#71

✅ shipped; consumer adoption pending

Concept drift (embedding-clustering)

audit_sister_doc_concept_drift (proposed)

future

#72

open; requires MiniLM (already vendored for leakage) + clustering design

Pre-tag + post-v1.0.3 dogfood summary#

The validators were dogfooded against eval-toolkit’s own markdown surface to prove they work in production before consumer adoption:

  • v1.0.1 pre-tag dogfood (validate_citations): 95 files (README + docs/source/**/*.md ex-ADRs ex-Sphinx-generated); 0 misalignments. Helper _build_adr_subjects_from_repo() exercised.

  • Post-v1.0.3 dogfood (re-run + synthetic validate_reader_value_bindings):

    • Part 1 (citation_alignment expanded): 96 files; 0 misalignments (sanity preserved after audit_value_bindings.md + R10 + R11 additions to docs/source).

    • Part 2 (value_bindings synthetic): bindings derived from Round 8 RC4-reconciled tally (("Round 8", "confirmed_count") -> 13 etc). Validator processed audit_findings.md + migration/v0.51.md + CHANGELOG.md without crashing; surfaced 216 pattern-matches + 82 matches. The high false-positive count is expected — the synthetic bindings dict has very loose detector (“Round 8”) + metric (numeric-count) patterns, and any number near a “Round 8” mention triggers. NOT real bugs in eval-toolkit’s docs — just the validator successfully exercising the regex+window pipeline at production-scale markdown.

The dogfood script is preserved at .scratch/dogfood_v1_0_x.py (gitignored) for future re-runs.

Round 11 outcome#

  • ✅ Action 1 (consumer adoption): consumer PR #10 merged 2026-05-26 01:48Z. Consumer-side dogfood + adoption complete for validate_citations.

  • ✅ Action 2 (this ledger entry): R11 section recorded.

  • ✅ Action 3 (dogfood expansion): validators exercised on eval-toolkit’s own surface; no upstream-side findings.

GH issue #77 closed with state_reason=completed.

Multi-LLM audit cadence after R11#

The Gate 3 audit chain (R5–R10) closed at v1.0 per ADR 0003. R11+ is the post-v1.0 consumer-feedback loop and uses NO multi-LLM cross-review by default. Multi-LLM dispatch resumes only if:

  1. A v2.0 design cycle opens (would require a Gate 3-equivalent re-audit per ADR 0003 §”v2.0 re-audit triggers”).

  2. A consumer files a major-severity finding that single-LLM verification (Claude only) leaves uncertain — in which case Codex

    • Gemini dispatch via the audit-prompt.md template re-engages for that specific question.

Otherwise R11+ is light-touch documentation of the consumer-feedback → upstream-validator → consumer-adopt cycle.


Round 12 (2026-05-26) — schema-extensibility + scope-correctness lesson from #80#

Not a multi-LLM gate-style audit. Round 12 is the second consumer-feedback-driven round, immediately following R11. The R11 audit-validator family (3/3 shipped) revealed a deeper structural gap in audit_value_bindings that the v1.0.3 implementation could not address: the 2-tuple (detector, metric) canonical-identity schema collapsed across slices, producing ~95 false positives on the consumer’s writeup. The fix (v1.1.0) is bigger than the issue’s literal ask — it surfaces two architectural rules that now govern all audit validators (pending ADR 0005).

Motivating finding (consumer audit origin)#

  • Consumer’s v1.3.9 run (2026-05-26, prompt-injection-detection-submission): scripts/audit_value_bindings.py against own writeup produced 96 warnings, ~95 false positives. The one real bug (WRITEUP_PAPER.md:545 mis-citing 0.971 inside a misleading paragraph context) was correctly flagged but hid in the noise. Consumer caught it via parallel manual audit, not via this validator. Filed #80 at 2026-05-26T18:15:45Z proposing a 3-tuple (detector, metric, slice) schema extension.

Upstream response timeline#

Time (UTC)

Event

2026-05-26T18:15Z

Consumer files #80 (BINDINGS slice-axis schema)

2026-05-26T18:30Z

/exploring-options Round 1: P0 path for v1.0.4 PyPI gap. Decision: refined bundle (rerun + workflow_dispatch + verify-receipt).

2026-05-26T19:00Z

/exploring-options Round 2: #80 migration shape. Decision: Option 4 — BindingKey structured key + multi-shape adapter (rejected Option 1 (d, m, s) in-place tuple because positional tuples lock in the recur-every-N-months schema-event pattern).

2026-05-26T19:24Z

v1.0.5 ships — infrastructure-only release (publish workflow hardening; dress rehearsal for the new Verify PyPI receipt step).

2026-05-26T19:30Z

/exploring-options Round 3: Step 1 release strategy + verify specs. Decision: v1.0.5 as dress rehearsal; conservative 12×30s PyPI poll.

2026-05-26T19:45Z

/exploring-options Round 4: ADR 0005 scope + deprecation policy. Decision: medium scope (audit validators only); indefinite acceptance of legacy 2-tuple BINDINGS through v1.x.

2026-05-26T20:00Z

First dogfood of v1.1.0 slice-axis fix: only 22% noise reduction (95 → 74). User pushed back: “what is the right long-term solution?” Triggered the deeper rethink.

2026-05-26T20:10Z

Identified the second architectural layer: scope correctness. ~80% of v1.0.5 residual noise was content-type confusion (CI brackets, table cells, code blocks), not slice-axis confusion. Added scope='narrative' filter.

2026-05-26T20:11Z

v1.1.0 shipsBindingKey + slice-aware matching + scope='narrative' (closes #80).

End-to-end cycle: #80 filing (18:15Z) → v1.1.0 ship (20:11Z) = ~2 hours compressed cycle. Four /exploring-options rounds + one user-driven architectural redirect (“ultrathink on what the right long-term solution is”) produced a coherent two-layer fix instead of the issue’s narrower literal ask.

The two-layer correctness lesson (pending ADR 0005)#

Audit validators have two correctness layers, not one:

  1. Identity correctness — canonical measurements have structured identity (frozen dataclass with named fields), not positional tuples. Future identity axes (split, ci_kind, source_ref, …) added as defaulted fields without breaking the dict-key schema. Positional tuples lock the validator into a recur-every-N-months schema-event pattern.

  2. Scope correctness — the validator should only scan content plausibly a binding claim. Narrative prose sentences are. Markdown table cells aren’t (they’re structured data, audited differently). Bracketed expressions ([CI 0.286, 0.301]) aren’t (inline expressions, not point estimates). Code blocks aren’t.

This mirrors lint-design conventions from ruff/mypy/bandit (scope predicates like # noqa/# nosec/# type: ignore are first-class, not optional).

Dogfood evidence#

Configuration

Warnings on consumer HEAD

Reduction vs v1.0.5 baseline

v1.0.5 (legacy 2-tuple)

95

— (baseline)

v1.1.0 BindingKey + slice-fix only (scope=’all’, default slice_window_chars=120)

74

-22%

v1.1.0 BindingKey + slice-fix + scope='narrative' (recommended)

23

-76%

The first instinct — “just add a slice axis like the issue asks” — would have left the validator practically unusable on dense academic prose, requiring a separate v1.2.0 cleanup release. The deeper rethink (triggered by the user’s redirect) delivered both layers in a single coherent v1.1.0.

Residual 23 warnings are pre-existing positional-heuristic limitations (sentence-boundary unawareness, multi-detector list parsing on dense prose) not addressable without parser-level work; deferred to future v1.2.0+ as a narrower issue class.

Round 12 outcome#

  • v1.1.0 shipped (commit 911565b, PyPI live).

  • #80 auto-closed by closes #80 directive on the v1.1.0 tag.

  • 0 open issues on the eval-toolkit repo post-#80.

  • Architecture lesson codified in pending ADR 0005: the identity + scope two-layer rule applies to future audit validators.

  • Memory entry: feedback_validator_identity_plus_scope captures the “issue body says X is the problem; investigate the dominant noise source before agreeing” discipline.

Consumer adoption + HARD-gate viability#

Consumer (prompt-injection-detection-submission) currently pins eval-toolkit>=1.0.3,<2. Adoption path for v1.1.0:

  • Smallest diff: replace 2-tuple BINDINGS literal with 3-tuple keys (issue body’s proposal; works directly).

  • Recommended: migrate to BindingKey(detector=..., metric=..., slice=...) for forward-extensibility.

  • Add scope="narrative" to the validator call.

  • Re-run; expect ~76% noise reduction.

HARD-gate promotion at consumer’s v1.3.10+ becomes credible at the ~80% reduction level. Remaining residual false positives can be suppressed via consumer-side filtering (excluding lines containing “random floor” or “versus”) or accepted as known low-frequency noise pending v1.2.0+ pairing-rule improvements.

Multi-LLM audit cadence after R12#

Same as R11: no multi-LLM cross-review for consumer-feedback rounds; reserved for v2.0 design cycles or major-severity findings that single-LLM verification cannot resolve. The R12 fix was informed by the user-driven /exploring-options discipline (4 rounds + a structural redirect), which substitutes for multi-LLM dispatch for this class of architectural decision.


Round 13 (2026-05-26) — audit_value_bindings v1.2.0 context-aware noise reduction#

Not a multi-LLM gate-style audit. Third consumer-feedback-driven round, immediately following R12. R12 closed the slice-axis schema-extensibility gap (BindingKey + scope=’narrative’ content-type filter) and reduced the consumer’s noise floor from 96 → 36 warnings (62%). R13 addresses the residual 36 — the positional-heuristic limitations ADR 0005 named as “Future work (deferred)” — via four context-aware extensions to scope='narrative' shipped as v1.2.0.

Motivating evidence (consumer-side, post-v1.3.11)#

Consumer (prompt-injection-detection-submission@v1.3.11) ran the v1.1.0 audit_value_bindings against their writeup and produced 36 residual false positives. Consumer-side categorization in decisions/upstream_issues.md:92:

Category

Count

Discriminating prose pattern

Random-floor / sub-clause mentions

~10

“random floor of 0.374”, “baseline of 0.374”

Delta with sign / keyword

~9

“-0.071 AUPRC”, “drops -0.132 below”

Multi-detector “vs”/”versus” enumeration

~8

“AUPRC 0.556 vs 0.519” (same binding double-flagged)

Sentence-boundary cross-pairing

~5

“X scored 0.291. The floor is 0.374”

Genuinely ambiguous (cross-detector list constructions)

~4

“0.293 versus 0.364 for the frozen probe and 0.291 for TF-IDF”

The first four categories share a structural property: they’re narrative-prose context cues that the v1.1.0 positional heuristic can’t read. ADR 0005’s “Future work (deferred)” section named them explicitly. The fifth category (cross-detector list grammar) is a deeper parser-level problem.

Upstream design — /exploring-options 4 rounds + ultrathink redirect#

Time (UTC)

Event

~20:30Z

/exploring-options Round 1: which categories to address in v1.2.0? Decision: Full T1+T2+T3+T4 over the Explore agent’s recommended T1+T2-only. T3 and T4 are cheap (~70 LOC combined) and address ADR 0005-named deferred work.

~20:45Z

/exploring-options Round 2: API placement. Decision: all four filters bundled under scope='narrative'; backward compat for scope='all' preserved exactly.

~21:00Z

/exploring-options Round 3: sentence-boundary detection rigor. Decision: paragraph-aware abbreviation guard (vs./e.g./i.e./etc./cf./fig./eq./pp./viz./ca. excluded; single \n soft, \n\n hard; decimal + letter-dot-letter patterns guarded). Rejected naive α/β (over-splits multi-line academic prose); rejected γ library-based (nltk/spacy too heavy).

~21:10Z

/exploring-options Round 4: keyword list configurability. Decision: hardcoded module-level frozenset constants (_DELTA_KEYWORDS, _FLOOR_KEYWORDS). No new public kwargs.

~21:20Z

v1.2.0 ships — commit 40b3741, PyPI live, GH release published, Verify PyPI receipt step ✓. End-to-end from R12 closure to v1.2.0 ship: ~1 hour.

Dogfood evidence (compounded across the cycle)#

Release

Configuration

Warnings on consumer HEAD

Reduction vs v1.0.5

v1.0.5

Legacy 2-tuple, no scope filter

95

— (baseline)

v1.1.0

BindingKey + scope='narrative' content-type filter

23

-76%

v1.2.0

+ T1–T4 context filters

7

-93%

The 7 v1.2.0 residuals are all in the “cross-detector list construction” category — prose where the validator can’t infer that subsequent values belong to OTHER detectors via list connectives (“and”, “for X”, “vs”). T3 only deduplicates the SAME binding within one sentence; cross-detector list inference is a parser-level problem.

Tactical refinements during implementation#

Three calibration adjustments emerged from running the test suite + dogfood:

  1. _FLOOR_KEYWORDS narrowed — initial draft included “baseline”, “prior”, “majority”; removed after the existing test_detector_with_no_nearby_value_skipped failed on prose "The TF-IDF baseline performs well" (where “baseline” was a detector descriptor, not a floor reference). Replacement coverage for the legitimate floor cases is provided by T1’s “below” keyword (e.g., “below the prevalence baseline of 0.374” is caught by “below”).

  2. T1 keyword window changed to before-only — initial draft used symmetric ±30 chars; the combined-dogfood test failed on "frozen probe's 0.515 (delta -0.132)" because “delta” (AFTER 0.515) suppressed 0.515. Before-only window correctly fires T1 on -0.132 (preceded by “delta”) and leaves 0.515 alone.

  3. _DELTA_KEYWORDS calibrated — dropped “against” (suppressed legitimate “LoRA’s AUROC is 0.383 against frozen probe’s 0.515” matches). Kept “vs”/”versus” (canonical delta separator). Excluded “above”/”ahead”/”behind” (too ambiguous).

These calibrations are captured in the keyword-list comments in audit_value_bindings.py so future maintainers see the rationale.

Round 13 outcome#

  • v1.2.0 shipped (commit 40b3741, PyPI live).

  • ADR 0005 amended: “Future work (deferred)” section notes the v1.2.0 partial closure; sentence-boundary + 3 related context-aware filters resolved; cross-detector list grammar remains v1.3.0+ territory.

  • Consumer adoption path clear: re-pin eval-toolkit>=1.2.0,<2; HARD-gate promotion of audit_value_bindings becomes credible (7 residual < 10 actionable threshold).

  • Memory entry updated: feedback_validator_identity_plus_scope extended with the v1.2.0 sequel — the same “identity + scope” framing accommodates the new context-aware filters as scope extensions, vindicating the original two-layer architecture decision.

Multi-LLM audit cadence after R13#

Unchanged from R11/R12: no multi-LLM cross-review for consumer-feedback rounds. The R13 design was informed by the user-driven /exploring-options discipline (4 rounds + tactical calibration during implementation), substituting for multi-LLM dispatch for this class of incremental refinement. Future Round 14+ would re-engage multi-LLM dispatch only if a v2.0 design cycle opens or a major-severity consumer finding requires it.

Cross-references#

  • ADR 0005 — “Future work (deferred)” section now reflects v1.2.0 closure status.

  • v1.2.0 CHANGELOG entry — full per-filter detail (T1–T4 keyword lists, window calibration, dogfood numbers).

  • tests/test_audit_value_bindings.py — 36 tests (28 from v1.1.0 + 8 new for T1–T4 + sentence-boundary unit test).


Round 14 (2026-05-26) — Layer 3 pairing rules close cross-detector list-grammar (#81)#

Not a multi-LLM gate-style audit. Fourth consumer-feedback-driven round in the post-v1.0 cycle (R11 → R12 → R13 → R14). The v1.2.0 cycle (Round 13) reduced consumer noise from 96 → 4 warnings via the T1–T4 context-aware narrative filters; the residual 4 were all cross-detector list-grammar / metric-axis confusion cases that the two-layer identity-+-scope model couldn’t resolve. Consumer filed #81 documenting the 3 prose patterns + proposing a _LIST_CONNECTIVES extension; upstream v1.3.0 ships Layer 3 (pairing rules) per ADR 0006 as the architectural closure.

Motivating residuals from v1.2.0 dogfood#

Consumer’s v1.3.12 dogfood (eval-toolkit v1.2.0 adopted; 36 warnings → narrowed by their additional SKIP_PATTERNS to 4):

File:Line

Pattern category

Prose snippet

WRITEUP_PAPER.md:304

“for X” postfix (Pattern A)

"versus 0.364 [...] for the frozen probe and 0.291 [...] for TF-IDF + LR"

RESULTS.md:171 (×2)

possessive + metric confusion (Patterns B + D)

"LoRA's pooled OOD AUROC is 0.383 against frozen probe's 0.515"

README.md:71

group subject (Pattern C)

"0.38 AUROC, ~0.6 drop for the trained detectors"

Upstream design — /exploring-options 2 rounds + Pattern D emerged during dogfood#

Time (UTC)

Event

~21:30Z

Consumer files #81 documenting the 3 prose patterns (A/B/C) + proposing _LIST_CONNECTIVES design.

~22:00Z

/exploring-options Round 1: scope. Decision: A + B + C-suppress (not C-inference) — closes all 4 residuals at ~130 LOC; rejected A+B-only (leaves 1) and A+B+C-inference (~250 LOC, MODERATE-HIGH risk).

~22:10Z

/exploring-options Round 2: documentation. Decision: new ADR 0006 codifying Layer 3 (pairing rules) as the third correctness layer alongside ADR 0005’s identity + scope.

~22:30Z

Implementation reaches first dogfood: 4 → 2 warnings. Pattern D emerges: the residual 2 are metric-axis confusion ("AUPRC delta suggests: AUROC is 0.383" — wrong metric picked up by proximity). Added as fourth rule (symmetric to detector-axis pairing).

~22:45Z

Three tactical calibrations: (a) Pattern A intervening-value check now uses v1.1.0’s excluded_ranges (CI brackets don’t count as intervening); (b) Pattern B refactored to “last possessive within 30 chars” instead of requiring an “is” verb; (c) Pattern C uses v1.2.0’s _crosses_sentence_boundary to prevent paragraph-crossing suppression bleed.

~22:55Z

Override-bypass architecture: Pattern A/B now set pairing_confirmed_pos to BYPASS the proximity-based detector pairing when they confirm THIS binding’s detector (fixed the bug where override + proximity disagreed).

~23:00Z

v1.3.0 ships — commit 4c2742e. Consumer dogfood: 4 → 0 warnings. Combined v1.0.5 → v1.3.0: 95 → 0 (-100%).

End-to-end from #81 filing to v1.3.0 ship: ~1.5 hours.

Layer 3 architecture (per ADR 0006)#

Audit validators now have a three-layer correctness model:

Layer

Dimension

Mechanism

Release

1

Identity

BindingKey structured keys

v1.1.0

2

Scope

Content-type + context-keyword filters (scope='narrative')

v1.1.0 + v1.2.0

3

Pairing

Override / suppress proximity-based pairing under grammar cues

v1.3.0

Four Layer 3 rules ship in v1.3.0 (all under scope='narrative', no new public kwargs):

  • Pattern A: "for {detector}" postfix → override detector pairing (with intervening-value guard via v1.1.0 exclusion-ranges).

  • Pattern B: "{detector}'s" possessive → override detector pairing (last possessive within 30 chars of value).

  • Pattern C: "for the {trained|frozen|baseline|all|both|other} detectors" group subject → suppress candidate (with sentence-boundary guard via v1.2.0 sentence-positions).

  • Pattern D: metric-axis nearest-pairing (symmetric to detector-axis). Pre-collects ALL metric positions across consumer-supplied metric_aliases keys, not just binding-derived metrics.

Dogfood evidence (compounded across the four rounds)#

Release

Configuration

Warnings on consumer HEAD

Reduction vs v1.0.5

v1.0.5

Legacy 2-tuple, no scope filter

95

— (baseline)

v1.1.0

BindingKey + scope=’narrative’ content-type

23

-76%

v1.2.0

+ T1–T4 context filters

7

-93%

v1.3.0

+ Patterns A/B/C/D pairing rules

0

-100%

Round 14 outcome#

  • v1.3.0 shipped (commit 4c2742e, PyPI live).

  • #81 auto-closed by closes #81 directive on the v1.3.0 tag.

  • 0 open issues on the eval-toolkit repo post-#81.

  • ADR 0006 committed — Layer 3 pairing rules formally codified as the third correctness layer.

  • Consumer HARD-gate promotion now credible. With 0 residual warnings, the consumer can promote audit_value_bindings from SOFT to HARD bundled with audit_citation_alignment per their v1.3.8 plan.

The post-v1.0 consumer-feedback cycle (R11 → R14)#

The four post-v1.0 rounds form a complete consumer-feedback loop:

Round

Driver

Cycle time

Closure

R11

Consumer adopts v1.0.x audit-validator family

days

v1.0.4 (3 validators shipped)

R12

Consumer files #80 (BINDINGS slice-axis)

~2 hours

v1.1.0 + v1.2.0 (BindingKey + scope=’narrative’)

R13

v1.1.0 dogfood surfaces context-filter gaps

~1 hour

v1.2.0 (T1–T4 context filters; 95 → 23)

R14

Consumer files #81 (cross-detector list-grammar)

~1.5 hours

v1.3.0 (Layer 3 pairing rules; 4 → 0)

The pattern is consistent: consumer surfaces a structural gap → upstream files an ADR-driven design → ships within hours → consumer adopts same-day. The 100% noise reduction over four rounds validates the iterative consumer-feedback model as a substitute for heavyweight multi-LLM gate audits (which closed at v1.0 per ADR 0003).

Multi-LLM audit cadence after R14#

Unchanged from R11/R12/R13: no multi-LLM cross-review for consumer-feedback rounds. R14’s design was informed by /exploring-options (2 rounds) + Pattern D emergence during dogfood

  • 3 tactical calibrations during implementation. Multi-LLM dispatch remains reserved for v2.0 design cycles or major-severity findings.

Cross-references#

  • ADR 0006 — Layer 3 codification.

  • v1.3.0 CHANGELOG entry — full per-pattern detail with consumer adoption path.

  • tests/test_audit_value_bindings.py — 43 tests (36 from v1.2.0 + 7 for Patterns A/B/C/D + backward-compat + combined dogfood).


Round 15 (2026-05-26) — audit_citation_alignment Layer 2 + Layer 3 + ADR 0007 family-wide architecture#

Not a multi-LLM gate-style audit. Fifth consumer-feedback-driven round, opening the equivalent cycle for audit_citation_alignment that R11→R14 closed for audit_value_bindings. Consumer filed eval-toolkit#82 concurrent with their v1.3.13 adoption: 188 residual warnings on audit_citation_alignment, same architectural-class gap (Layer 2 + Layer 3 context-awareness) that audit_value_bindings worked through across three releases.

Architectural insight: ADR 0007 generalizes the model#

ADR 0005 (Layer 1 + 2) and ADR 0006 (Layer 3) were originally framed as audit_value_bindings-specific. The v1.4.0 cycle adopting the same architecture for audit_citation_alignment revealed that the three-layer model is the canonical architecture for the audit-validator family — not just one validator. ADR 0007 codifies this family-wide.

The v1.4.0 release also extracted shared narrative-prose helpers into private flat module eval_toolkit/_narrative.py (consistent with ADR 0001’s _rng.py/_parallel.py/_sweep.py precedent). Both validators import from this module; future audit validators inherit the same primitives.

Three-layer model applied to audit_citation_alignment#

Pattern

Layer

Mechanism

β

Layer 2

Exclude citations inside markdown table rows, bracketed expressions, fenced code blocks (reuses v1.1.0 _build_exclusion_ranges).

γ

Layer 3

Category-keyword extraction window bounded by the SENTENCE containing the citation (reuses v1.2.0 _sentence_boundary_positions).

α

Layer 3

Multi-category set membership: when the sentence matches multiple category keywords (multi-topic prose), accept the citation if the ADR’s actual category is in the set.

None-skip

Layer 2 refinement

Defer (skip the citation) when subject.category is None — the consumer’s category map can’t classify the ADR. Symmetric to the existing claim_category is None skip from v1.0.1.

Upstream design — /exploring-options 3 rounds + Pattern α generalization + None-skip emerged during dogfood#

Time (UTC)

Event

~21:30Z

Consumer files #82 documenting 3 prose patterns (α/β/γ) + 188-warning baseline.

~22:00Z

/exploring-options Round 1: scope. Decision: bundle Path A + Path B in v1.4.0 (one coherent release; patterns established).

~22:10Z

/exploring-options Round 2: helper sharing. Decision: extract to private flat module _narrative.py.

~22:20Z

/exploring-options Round 3: documentation. Decision: write ADR 0007 generalizing three-layer model to family.

~22:30Z

Implementation reaches first dogfood: 188 → 101 (46% reduction). Pattern β + γ + initial α (multi-citation only) firing.

~22:40Z

Symmetric-None skip emerges: dominant residual is actual=None cases (ADRs the consumer’s category map can’t classify). Symmetric to existing claim=None skip from v1.0.1. Added under scope='narrative' → 101 → 44.

~22:50Z

Pattern α generalized: extend from “multi-citation sentences only” to “any multi-topic sentence” (any sentence matching ≥2 category keywords). 44 → 37.

~22:55Z

Decision: ship at 37 rather than over-engineer. Above the #82 ≤20 target but a 5× reduction; residual 37 includes real misalignments worth consumer-triage + edge cases requiring parser-level understanding.

~23:00Z

v1.4.0 ships — commit 7c40490. ADR 0007 codifies family-wide architecture.

End-to-end from #82 filing to v1.4.0 ship: ~1.5 hours.

Dogfood evidence (audit_citation_alignment cycle)#

Release

Configuration

Warnings on consumer HEAD

Reduction vs v1.3.0 baseline

v1.3.0 (audit_citation_alignment with scope=’all’)

188

(baseline)

v1.4.0 (scope='narrative')

37

-80%

The residual 37 is above the original #82 acceptance criterion (≤20) but represents a 5× reduction. Categorized:

  • Real misalignments consumer should triage (e.g., ADR-025 cited for a threshold claim when ADR-025 is the cost ADR — could be a wrong-ADR bug or a multi-topic ADR not captured by the consumer’s category-keyword map).

  • Single-topic-sentence edge cases: Pattern α only fires for multi-topic sentences (≥2 category keywords matched). Sentences with one dominant topic + a cross-category ADR cite stay on the legacy first-match check.

  • Multi-topic ADRs where the consumer’s category map assigns a single category but the ADR genuinely covers multiple topics. Consumer-side category-map expansion would address these.

Round 15 outcome#

  • v1.4.0 shipped (commit 7c40490, PyPI live).

  • #82 auto-closed by closes #82 directive on the v1.4.0 tag.

  • 0 open issues on the eval-toolkit repo post-#82.

  • ADR 0007 committed — three-layer architecture (identity + scope + pairing) codified as canonical for ALL audit_* validators. ADRs 0005/0006 now framed as family-wide via 0007.

  • _narrative.py extracted — shared narrative-prose helpers; both validators import from one canonical location. Signature-preserving refactor; all v1.3.0 tests pass unchanged.

  • Consumer HARD-gate promotion: still a judgment call. The residual 37 includes some real misalignments worth fixing in consumer prose; HARD-gating now would block commits on those lines. Consumer’s v1.3.8 bundled-promotion plan should proceed only after triaging the 37.

The post-v1.0 consumer-feedback cycle (R11 → R15)#

The five rounds form a complete library-first cycle for the audit-validator family:

Round

Driver

Cycle time

Closure

R11

Consumer adopts v1.0.x audit-validator family

days

v1.0.4 (3 validators shipped)

R12

Consumer files #80 (BINDINGS slice-axis)

~2h

v1.1.0 + v1.2.0

R13

v1.1.0 dogfood surfaces context-filter gaps

~1h

v1.2.0

R14

Consumer files #81 (audit_value_bindings list-grammar)

~1.5h

v1.3.0 (Layer 3)

R15 (this)

Consumer files #82 (audit_citation_alignment scope+pairing)

~1.5h

v1.4.0 (Layer 2 + 3 + ADR 0007)

R11→R15 demonstrates the iterative consumer-feedback model as a substitute for heavyweight multi-LLM gate audits (closed at v1.0 per ADR 0003). Each round delivered closure within hours of the consumer filing, with same-day adoption. ADR 0007 captures the architectural template; future audit validators inherit the three-layer model and ship in fewer cycles.

Multi-LLM audit cadence after R15#

Unchanged from R11–R14: no multi-LLM cross-review for consumer-feedback rounds. R15’s design was informed by /exploring-options (3 rounds) + 2 dogfood-driven refinements (Pattern α generalization + symmetric-None skip). Multi-LLM dispatch remains reserved for v2.0 design cycles or major- severity findings.

Cross-references#

  • ADR 0007 — family-wide three-layer codification.

  • v1.4.0 CHANGELOG entry — full per-pattern detail with consumer adoption path.

  • tests/test_audit_citation_alignment.py — 18 tests (12 from v1.0.1 + 6 for v1.4.0 Patterns α/β/γ + backward-compat + shared-helpers + combined dogfood).