Audit findings ledger#
This ledger tracks findings from each cross-model methodology audit (Gate 3 per ADR 0003 — to be drafted at v0.48). Each row records the finding ID, severity, finding summary, disposition (how it was resolved or scheduled), and a link to the tracked issue (where applicable).
Convention: blocker-severity findings get a p1-gate3-labelled GitHub issue
for fix-tracking. Lower-severity findings are recorded here only.
Cross-references:
The audit prompt template is at
gate3-audit-prompt.md(local — not in published docs).The v1.0 plan that drives audit cadence is at
~/.claude/plans/evaluate-all-the-work-twinkly-kite.md(local).Audit re-run schedule: after each breaking minor (v0.46, v0.47, v0.48) plus the original Round 5 pre-implementation pass. 7-day audit-completion timeout per gate.
Round 5 (2026-05-21) — Codex + Gemini pre-implementation audit#
Reviewers: author (manual) + Codex (independent report) + Gemini (independent report).
Packet: v0.44.0 code state + the v1.0 release plan
(~/.claude/plans/evaluate-all-the-work-twinkly-kite.md) +
docs/source/methodology/ (16 chapters) + docs/source/roadmap.md +
CHANGELOG.md + src/eval_toolkit/*.py + existing migration guides
(migration/v0.7.md, v0.8.md, v0.9.md).
Audit prompt:
~/.claude/plans/gate3-audit-prompt.md.
ID |
Severity |
Finding |
Disposition |
Issue |
|---|---|---|---|---|
R5-F1 |
blocker before v0.46 |
|
Decision R: drop F1/accuracy/precision/recall from v0.46 specs; keep |
(in plan) |
R5-F2 |
blocker before v0.46 |
|
Decision S: reuse |
(in plan) |
R5-F3 |
blocker before v0.47 |
Sweep unification plan assumes |
Plan revised: create 3 preprocessing dataclasses at v0.47 as part of sweep work; sweep contract clarified (neutral default; explicit |
(in plan) |
R5-F4 |
blocker before v0.46 |
Plan’s scalar-deprecation directive would replace the lazy export resolver ( |
Plan corrected: extend existing |
(in plan) |
R5-F5 |
blocker before v1.0 |
DeLong ( |
Decision U: keep public; update |
(in plan) |
R5-F6a |
packet drift |
|
Docstring fix bundled into v0.48 §5E-prep. Code is correct per Bayle (2020) Thm 3.1; docstring oversells |
(in plan) |
R5-F6b |
packet drift |
|
Doc fix bundled into v0.48 §5E-prep. Also clarify |
(in plan) |
R5-F6c |
packet drift |
|
Doc fix bundled into v0.48 §5E-prep |
(in plan) |
R5-F6d |
packet drift |
|
Doc fix bundled into v0.48 §5E-prep |
(in plan) |
R5-F6e |
packet drift + code |
|
Doc fix + defensive code guard bundled into v0.48 §5E-prep: rewrite example with disjoint split + raise |
(in plan) |
R5-F6f |
partial verified |
|
Docstring sharpening (add |
(in plan) |
R5-F7 |
governance risk |
Gate 3 redefinition (multi-model cross-review) is useful but not the same evidence class as external academic peer review; the plan understated this |
Decision O revised: ADR 0003 explicitly states Gate 3 at v1.0 is internal model-assisted cross-review, NOT external academic peer review; documents what it catches and doesn’t |
(in plan) |
R5-Px |
plan internal inconsistency |
v0.46 verification smoke imports |
Plan smoke test corrected — |
(in plan) |
Round 5 totals: 12 findings (5 blockers, 1 high-pre-v1, 5 packet drift, 1 governance, 1 plan inconsistency). All resolved or scheduled before any v0.46 code shipped.
Key follow-on decisions (driven by Round 5):
Decision W: packet-drift fixes (F6a–F6f) bundle into v0.48 polish (not earlier hotfix).
Decision X: scorecard skipped-status detection reuses
is_metric_defined_for_slicefrom v0.39.0.Decision Y: Round 6 / 7 / 8 audits run as formal stop-gates between v0.46 / v0.47 / v0.48 / v1.0; 7-day timeout per gate.
Decision Z: this ledger file.
Round 6 (complete: 2026-05-21 — Codex + Gemini + manual; 11 findings)#
Ship date: v0.46.0 tagged + published to PyPI 2026-05-21. STOP-GATE per
Decision Y.2 — release/v0.47.0 opens once R6-F1 (BLOCKER) + R6-F2 (HIGH)
are fixed via v0.46.1 hotfix and the remaining items are scheduled.
Reviewers: author (manual) + Codex (independent) + Gemini (independent).
Packet to send to Codex + Gemini:
This plan file (
~/.claude/plans/evaluate-all-the-work-twinkly-kite.md)This audit ledger (
docs/source/audit_findings.md)docs/source/methodology/(16 chapters — unchanged from Round 5)docs/source/roadmap.mddocs/source/migration/v0.46.md(NEW — consumer migration guide)docs/source/adr/0002-scorecard-as-primary-metric-surface.md(NEW)CHANGELOG.md(new v0.45.0 + v0.46.0 entries since Round 5)Source tree, with focus on the v0.45 + v0.46 additions:
src/eval_toolkit/stacking.py(v0.45, ~370 lines)src/eval_toolkit/scorecards.py(renamed from_scorecard.pyat v0.49.0; ~585 lines)src/eval_toolkit/metric_specs.py(v0.46, ~180 lines)src/eval_toolkit/__init__.py—__getattr__deprecation branch +_DEPRECATED_SCALARS(Audit F4 invariant region)src/eval_toolkit/metrics.py:120-200—SINGLE_CLASS_INCOMPATIBLE_METRICSextension (Round-5 X.2 precondition)
Audit prompt:
~/.claude/plans/gate3-audit-prompt.md
(local). The “Known issues already in the v0.48 backlog (skip re-reporting)”
section already lists drift items scheduled for v0.48 polish — Round 6
reviewers should skip those and surface only NEW findings against the v0.46
state.
Focus areas for Round 6 review:
scorecard surface design lock-in. The Tier-2
MetricSpecProtocol freezes at v1.0 — method-signature changes require a v2.0 major bump. Last cheap chance to catch contract gaps.MetricResult cell-state contract — does the
ok/skipped/errorvocabulary cover every relevant failure mode? Are the reason strings useful for triage?Per-cell error isolation — confirm that catching all exceptions in
_evaluate_specdoesn’t hide important failures the user should see.__getattr__deprecation shim — Audit F4 invariant: does the branch correctly route deprecated names, NOT break non-deprecated resolution, and cleanly delete at v0.47?Spec name encoding for parameterized metrics — is
"ece_n_bins_15_strategy_uniform"a stable v1.0 commitment, or does the alphabetize-kwargs rule create surprise keys for custom user specs with multi-kwarg signatures?Scorecard.to_pandas()MultiIndex schema — first-time-public; any shape lock-in concerns?
Triage on findings: each blocker → p1-gate3-labelled GitHub issue +
a row in this ledger. Either fix-as-v0.46.1-hotfix or fold into v0.47
design (per Decision Q severity-tiered hotfix policy).
ID |
Reviewer |
Severity |
Finding |
Disposition |
Lands |
|---|---|---|---|---|---|
R6-F1 |
Codex |
BLOCKER before v0.47 opens |
|
Add strategy validation in |
RESOLVED v0.46.1 (commit |
R6-F2 |
Codex + Gemini |
HIGH before v0.47 scalar hard-removal |
ECE deprecation warnings in |
Restructure |
RESOLVED v0.46.1 (commit |
R6-F3 |
Codex |
HIGH before scorecard freeze |
Duplicate |
Decision R6-B (locked): reject in |
v0.47 |
R6-F4 (= Gemini R6-F1) |
Codex + Gemini |
HIGH before v1.0 |
|
Decision R6-A (locked): deterministic-by-default; fix docs only. No behavior change. Plan §4G-prep. (Decision R6-E: rolls to v0.47 — R6-A is non-blocker per Decision Q’s “docstring” category.) |
v0.47 |
R6-F5 |
Codex |
Contract-enforcement gap before v1.0 |
ADR 0003 promises strict Tier-2 Protocol method-shape stability; current public-API drift guard only snapshots |
Decision R6-D (locked): extend |
v0.47 |
R6-F6 |
Codex |
Packet drift |
v1.0 plan + roadmap still describe pre-v0.46 scorecard shapes that didn’t ship: |
Plan §4L: refresh plan §3A scorecard examples + roadmap shipped-state section. Doc-only commit on v0.47 release branch. |
v0.47 |
R6-F3 (Gemini) |
Gemini |
MEDIUM (schema lock-in before v1.0) |
|
Decision R6-C (locked): add |
v0.47 |
R6-F4 (Gemini) |
Gemini |
LOW |
|
Decision R6-H (locked): add |
v0.47 |
R6-F5 (Gemini) |
Gemini |
LOW |
|
Narrow exception catch: |
v0.47 |
Round 6 totals: 11 findings (Codex 6 + Gemini 5; 2 overlap on seed=None + ECE deprecation snippets but with different reasoning angles). 1 BLOCKER (R6-F1) + 5 HIGH + 2 MEDIUM/contract + 3 LOW. All dispositioned to either v0.46.1 (2 fixes) or v0.47 (9 fixes).
Key follow-on decisions (driven by Round 6 — locked in plan):
Decision R6-A:
seed=Nonedeterministic-by-default; fix docs only.Decision R6-B: Reject duplicate
MetricSpec.namewithValueError.Decision R6-C: Add
n_resamples+methodtoto_pandas()schema.Decision R6-D: Extend public-API snapshot to cover Protocol method signatures.
Decision R6-E: v0.46.1 scope = R6-F1 + R6-F2 only; R6-A rolls to v0.47 (non-blocker per Decision Q’s “docstring” category).
Decision R6-F: Use
n_bins=10(pre-v0.46 default) in deprecation warnings + migration note about new v0.46+ factory default ofn_bins=15. Corrects Gemini’s misverified pre-v0.46 default claim.Decision R6-G: 3 ECE variants without
metric_specs(debiased, l2, l2_debiased) route deprecation warnings to submodule path; do NOT add tometric_specsat v0.47.Decision R6-H:
make_spec_name()helper inmetric_specssubmodule only; not top-level.
Round 6 v0.46.1 ship status (2026-05-21)#
R6-F1 ✅ SHIPPED in v0.46.1 (PR #67, squash
7a4bb14). End-to-end verified in consumer:ms.ece(strategy="typo")raisesValueError; direct_EceSpec(strategy=...)construction also raises (defence-in-depth).R6-F2 ✅ SHIPPED in v0.46.1 (PR #67, squash
7a4bb14). End-to-end verified in consumer:eval_toolkit.expected_calibration_errorwarning carriesece(n_bins=10)+ keyece_n_bins_10_strategy_uniform+ migration note about v0.46+n_bins=15default. All 3 non-first-party variants (debiased, l2, l2_debiased) route tofrom eval_toolkit.metrics import …submodule path.9 other Round 6 items (R6-A docstring, R6-B duplicate-name guard, R6-C to_pandas schema, R6-D Protocol method-shape snapshot, R6-F4-Gemini
make_spec_name, R6-F5-Gemini narrowexcept, R6-F6 plan/roadmap state-drift) → folded intorelease/v0.47.0per Decision R6-E.Round 6 STOP-GATE status: CLOSED.
release/v0.47.0can open after the v0.46.1 consumer cycle observation completes (1 cycle).
Round 7 (complete: 2026-05-21 — Codex + Gemini; 3 substantive findings)#
Reviewers: author (manual) + Codex (independent report) + Gemini (independent report).
Packet: v0.47.0 code state + the v1.0 plan + docs/source/methodology/
(16 chapters) + ADRs 0001/0002/0003 + docs/source/migration/v0.46.md +
docs/source/migration/v0.47.md + Round 5/6 ledger.
Round-7 briefing: gate3-audit-round-7.md (committed a9e1114).
Reports: gate3-audit-round-7-codex-report.md + gate3-audit-round-7-gemini-report.md (untracked per .gitignore).
Headline: Codex 3 substantive findings; Gemini 0. Overlap was zero between the two reports — the most consequential finding (R7-F1 doc-migration boundary gap between Sybil-tested fences and MyST-NB-executed example notebooks) was Codex-only. Reinforces the Round 6 pattern (do not use overlap as a confidence floor; single-reviewer findings can be the most critical).
ID |
Reviewer |
Severity (their words) |
Finding |
Disposition |
Lands |
|---|---|---|---|---|---|
R7-F1 |
Codex |
high before v0.48 |
v0.47 doc migration missed MyST-NB executable example notebooks (separate from Sybil-collected |
Decision R7-A (locked at /exploring-options Q3): bundle into v0.48 §5G/§5H. §5G migrates the 6 notebooks + 4 docstrings + autosummary + roadmap; §5H enables |
RESOLVED v0.48.0 (§5G commit |
R7-F2 |
Codex |
high before sweep freezes |
|
Decision R7-B option C (locked): emit |
RESOLVED v0.48.0 (§5I commit |
R7-F3 |
Codex |
worth fixing before v1.0 |
|
Decision R7-C (locked): API-level |
RESOLVED v0.48.0 (§5J commit |
Gemini observations (Round 7)#
Gemini’s report verdict was “highly stable; release/v0.48.0 is safe to open.” Six minor observations / validations; nothing critical that Codex hadn’t covered. The actionable items folded into v0.48:
§1-3 + 5-7: VALIDATIONS of v0.47 shipped state (
TextTransformshape, shim removal, sweep design, R6-D Protocol method-shape snapshot, ADR 0003 tiers). No action needed.§4 (pedagogical drift): Gemini noted “from eval_toolkit.metrics import pr_auc” is syntactically green but slightly undermines ADR 0002. v0.48 §5G migration explicitly chose
scorecard()for example notebooks teaching METRIC USAGE; submodule path only where teaching the underlying math.§4 (Makefile pre-push): Gemini recommended hardening to prevent the
pytest tests/path-override trap. Landed as v0.48 §5L (make pre-pushtarget running all 3 doc-execution surfaces; commit9878a54).§5 (R6-C dtype coercion): Gemini noted
n_resamples(int + NaN) →float64is an accepted tradeoff. Landed as v0.48 §5K (Notes section onScorecard.to_pandas()docstring; commit6304cea).§6 (SynonymSubstitution whitelist): Gemini recommended adding a docstring note about the hardcoded 6-entry whitelist. Landed as v0.48 §5K (
adversarial.pyNotes section; commit6304cea).
Audit-as-seed extensions (v0.48)#
Per user direction during plan refinement (“use the audits as seeds for things to reconsider”), the Round 7 findings + style-invariants framing surfaced additional v0.48 scope beyond Codex’s explicit list:
§5G expansion: from 4 Codex-flagged module docstrings to full sweep across
src/eval_toolkit/module docstrings + audit of alldocs/source/api/*.mdautosummary pages. Found 8 missing API pages + 5 drifted autosummary lists.§5M new: in-source docstring drift audit (third doc-execution surface). Result: 82 PASS / 1 skipped / 0 fail; expanded
.doctest-modulesfrom 11 → 21 modules so CI catches future drift.§5N comprehensive: cross-API shape-validation consistency sweep beyond Codex’s R7-F3 target. Audited
metrics_at_threshold,paired_bootstrap_op_point_diff,bootstrap_metric_from_predictions,metrics.pyscalars,fit_*_binary/fit_*_calibrator. Tightening commit landed formetrics_at_thresholdsilent threshold semantics (commit76773dc);paired_bootstrap_op_point_diffval_y is test_yguard landed as part of §5E-prep code-side fix (commit5c8e68d).
Round 7 ship status#
3 substantive Codex findings: all RESOLVED in v0.48.0 via §5G + §5H + §5I + §5J.
6 Gemini observations: all RESOLVED in v0.48.0 via §5G + §5K + §5L.
Audit-as-seed extensions (§5G expanded, §5M new, §5N comprehensive): all RESOLVED in v0.48.0.
Round 7 STOP-GATE status: CLOSED via v0.48.0 release. Round 8 audit STOP-GATE per Decision Y.2 opens against the v0.48.0 state before
v1.0.0tag can land.
Round 8 (2026-05-24) — STOP-GATE before v1.0 tag, rectified in v0.51.0#
Multi-LLM cross-review fired against the v0.50.0 state. Codex returned
10 findings (3 with reproducible probes: C3, C4(a)/(b), C9); Gemini
returned 5 findings + 3 positive validations (V1 style, V2 methodology,
V3 scorer_error safety). Independent verification by Claude
(audit-verification-codex-gemini-v0.50.0.md, 510 lines, untracked
per .gitignore) confirmed 13 of 18 items, refuted 3 (R8-G2, R8-G5,
R8-V1+R8-V2 paired-as-one over-confident Gemini validations), and
deferred 2 (R8-G3, R8-G4) to v1.x as Tier-2 additive.
The v0.51.0 release ships fixes for all 13 confirmed items; 2 (G3, G4)
are deferred to v1.x as Tier-2 additive; 3 refuted entries are recorded
below for the audit trail.
Round 9 multi-LLM cross-review runs against the v0.51 RC before v1.0 tags.
Confirmed (13) — fixed in v0.51.0#
ID |
Severity |
Finding |
Disposition |
Commit |
|---|---|---|---|---|
R8-C3 |
blocker (escalated from Codex fix-rec) |
|
v0.51 sentinel |
|
R8-C4(a) |
blocker (escalated) |
Generator-rng not parallel-stable in |
v0.51 spawns child SeedSequences at the dispatch boundary. |
|
R8-C4(b) |
blocker (escalated) |
|
v0.51 draws fresh entropy via |
|
R8-C1 |
fix-recommended |
|
v0.51 adds |
|
R8-C2 |
fix-recommended |
|
v0.51 caps iter_folds at |
|
R8-C5 |
fix-recommended |
README links to |
v0.51 repoints all README links to |
|
R8-C6 |
fix-recommended |
|
v0.51 Tier-2 additive validation matching |
|
R8-C7 |
fix-recommended |
Tests covered counts where semantics matter: |
v0.51 adds semantic-property regression tests bundled with C1/C2/C4 fixes. |
|
R8-C8 |
minor |
|
v0.51 demotes SimilarityStrategy to “pre-v0.7 internal interface” in README + extending.md (aligns to v1.0 contract per ADR 0003). |
|
R8-C9 |
minor |
|
v0.51 docstring documents the JSON-safety contract — strict-JSON requires |
|
R8-C10 |
minor |
|
v0.51 |
|
R8-F1 |
fix-recommended (Claude — missed-by-both) |
|
v0.51 eager validation matching sibling-kwarg pattern. |
|
R8-F2 |
minor (Claude — missed-by-both) |
|
v0.51 eager |
|
R8-F3 |
minor (Claude — missed-by-both) |
|
v0.51 detects missing columns at read time; raises actionable |
|
Refuted (3) — recorded for audit trail; NO fix shipped#
ID |
Severity (as auditor marked it) |
Finding (rejected) |
Rationale |
|---|---|---|---|
R8-G2 |
fix-recommended (Gemini) |
“Cyclic import leakage→harness identified but left unresolved.” |
REFUTED: |
R8-G5 |
minor (Gemini) |
“Plotting tests use weak assertions ( |
REFUTED: |
R8-V1 |
“Style + consistency: Exceptional” (Gemini) |
Validation that the axis is exceptional. |
REFUTED: Gemini’s specific tooling claim (Ruff + Black + Mypy strict compliance) is true, but the axis is broader. Codex C1 + C5 + C6 + C7 + C8 + R8-G3 confirm substantive cross-API / docs / contract consistency drifts. Tooling-clean ≠ axis-exceptional. |
R8-V2 |
“Domain methodology: Masterclass” (Gemini) |
Validation that methodology is masterclass. |
REFUTED: Codex C1 + C3 + C4 + C6 confirm methodology bugs at fix-recommended-or-blocker severity. Methodology has real flaws; “Masterclass” overstates. Gemini’s evidence (research-grounded tests exist) conflates research-citation discipline with methodology correctness. |
R8-V3 |
“scorer_error catches predictions safely” (Gemini) |
Validation that the mechanism is safe. |
CONFIRMED, recorded for completeness: |
Deferred to v1.x (2)#
ID |
Severity |
Finding |
Disposition |
|---|---|---|---|
R8-G3 |
fix-recommended (Gemini) |
Codebase uses |
Deferred to v1.x as Tier-2 additive. Custom exception hierarchy subclassing ValueError preserves catch-compatibility; adding it post-v1.0 is non-breaking. Reconsider when a downstream consumer requests it. |
R8-G4 |
minor (Gemini) |
joblib OOM hazard documented at |
Deferred to v1.x. Memory-aware capping needs RAM-measurement + dataframe-size accounting — non-trivial, no clear best-practice. Caller is responsible for |
Round 8 ship status#
13 confirmed findings: all RESOLVED in v0.51.0 via the per-finding commits above (Phase 1 + 2 + 3 + 4 on
release/v0.51.0).3 refuted findings (R8-G2 cyclic-import framing, R8-G5 cherry-picked weak test, R8-V1+R8-V2 over-confident Gemini validations paired-as-one): recorded above; no shipped fix needed.
2 deferred findings (R8-G3 custom exceptions, R8-G4 joblib OOM capping): Tier-2 additive; v1.x or later.
Tally: 13 + 3 + 2 = 18 total (canonical; aligned with CHANGELOG.md v0.51 section + migration/v0.51.md per RC4 reconciliation).
Round 8 STOP-GATE status: CLOSED via v0.51.0 ship. Round 9 audit STOP-GATE per Decision Y.2 opens against the v0.51.0 RC before
v1.0.0tag can land.
Round 9 (2026-05-24) — STOP-GATE before v1.0 tag#
Round 9 multi-LLM cross-review fired against the release/v0.51.0 RC
(7 commits at the time of audit; PR #75 draft, CI green). Codex
returned 4 substantive findings (RC1-RC4) + self-validation worklog
(69/69 v0.51 regression tests pass); explicitly skipped the 10 modules
neither Round-8 auditor cited + 4 cross-cutting hunts. Gemini returned
6 items (RG1-RG6): 3 design challenges (2 escalating already-locked v0.51
decisions to v1.0 blockers) + 3 positive validations; zero probes, zero
line citations, ~45% validation density. Independent verification by
Claude (audit-verification-round-9-v0.51.0.md, 228 lines; gitignored
per R8-C10) confirmed 6 / refuted 3 / partial 1 of the 10 source-report
items, AND surfaced 3 third-audit findings + 3 minors by hunting the
modules Codex skipped. Of those, F-sweep-1 was a CANDIDATE v1.0 BLOCKER
(R7-C “no silent failures” invariant violation on NaN scorer output)
that neither Round-8 nor Round-9 auditor caught.
Per the Path 3 lock from the post-R9 planning session, two third-audit
findings (F-sweep-1 + F-bootstrap-1) ship fixed in this PR; RC1
already fix-in-PR at 4c43771; the remaining items defer to v1.0.1.
Confirmed (6) — disposition mapped#
ID |
Severity |
Finding |
Disposition |
Commit / Defer-to |
|---|---|---|---|---|
R9-RC1 |
fix-recommended (Codex) |
README.md:208 has 2 broken |
Fixed in PR. |
|
R9-RC2 |
fix-recommended (Codex) |
|
Deferred to v1.0.1. Two paths: add to strict-Tier-2 list (10 strict + 1 opt-in; ADR 0003 update) OR remove from |
v1.0.1 |
R9-RC3 |
fix-recommended (Codex) |
|
Deferred to v1.0.1 test hardening. |
v1.0.1 |
R9-RC4 |
minor (Codex) |
“3 refuted” counts categories (V1+V2 paired as one) while 4 named items listed; defensible across all 4 surfaces but invites confusion. |
Deferred to v1.0.1 docs polish. |
v1.0.1 |
R9-RG4 |
validation (Gemini) |
metrics.py |
Validation accurate; pattern at metrics.py:35-99 is well-designed. |
— |
R9-RG5 |
validation (Gemini) |
bootstrap.py defaults match “industry standards”. |
Validation accurate: |
— |
Refuted (3) — recorded for audit trail; NO fix shipped#
ID |
Severity (as auditor marked it) |
Finding (rejected) |
Rationale |
|---|---|---|---|
R9-RG2 |
v1.0 blocker (Gemini) |
“DeprecationWarning without sunset date is structurally wrong; v1.0 blocker.” |
REFUTED: re-litigates Q3-locked v0.51 design. |
R9-RG3 |
v1.0 blocker (Gemini) |
“iter_folds capping silently is wrong; should raise.” |
REFUTED: re-litigates R8-C2 design lock. |
R9-RG6 |
validation (Gemini) |
“Regression tests apply strict invariant assertions, not weak proxies.” |
REFUTED — directly contradicted by RC3 which I independently confirmed. |
Partial (1)#
ID |
Severity |
Finding |
Disposition |
|---|---|---|---|
R9-RG1 |
accept-with-rationale (Gemini) |
|
PARTIAL: internally NO callers (grep across |
Third-audit findings (Claude — modules neither auditor cited) — 3 substantive + 3 minor#
ID |
Severity |
Finding |
Disposition |
Commit / Defer-to |
|---|---|---|---|---|
R9-F-sweep-1 |
blocker for v1.0 (Claude escalation) |
|
Fixed in PR. NaN/inf check added to R7-C boundary; test bundled. |
New commit on |
R9-F-bootstrap-1 |
fix-recommended |
scipy’s BCa returns degenerate CIs ( |
Fixed in PR. UserWarning added at the bootstrap_ci boundary when BCa degenerates; test bundled. |
New commit on |
R9-F-bootstrap-2 |
minor |
|
Bundled into the F-bootstrap-1 commit. |
New commit on |
R9-F-metrics-1 |
minor (Claude) |
|
Deferred to v1.0.1 docs polish. |
v1.0.1 |
R9-F-metrics-3 |
fix-recommended (Claude) |
|
Deferred to v1.0.1; either guard against all-same-score input OR explicitly document. |
v1.0.1 |
R9-F-metrics-4 |
minor (Claude) |
|
Deferred to v1.0.1 docs polish. |
v1.0.1 |
Round 9 ship status#
6 confirmed source-report findings: 1 fixed in PR (RC1 →
4c43771), 4 deferred to v1.0.1, 2 validation-confirmations (RG4, RG5 — no action).3 refuted source-report findings (RG2, RG3, RG6): recorded above for audit trail; no shipped fix.
1 partial (RG1): accept-with-rationale at v1.0.
2 third-audit fixes shipped in this PR: F-sweep-1 (CANDIDATE BLOCKER closed); F-bootstrap-1 + F-bootstrap-2 bundled (degeneracy warning + NaN guard).
4 third-audit items deferred to v1.0.1: F-metrics-1, F-metrics-3, F-metrics-4, RC2 (SimilarityStrategy contract resolution).
Round 9 STOP-GATE status: CLOSED with mitigations per the Path 3 lock. v1.0 can tag from this PR post-CI-green.
Multi-LLM audit-machinery calibration notes#
Codex R9 quality: 4/4 findings hold; RC1 over-cited (3 sites named but only 1 had real broken links); no padding finding this round. Gap remains: explicit skip of the 10 untouched modules + 4 cross-cutting hunts. F-sweep-1 (the candidate blocker) lives in precisely that gap. Future-round briefings should make the untouched-modules + hunts mandatory.
Gemini R9 quality: 2/6 validations honest (RG4, RG5); 3/6 REFUTED (RG2, RG3, RG6). RG6 is the new V1/V2 pattern — uncited positive validation on tests Gemini didn’t read; directly contradicted by Codex’s RC3 + my verification. RG2 + RG3 escalate locked design without engaging with the recorded rationale.
Trust calibration: Codex’s probe-backed findings → high trust; Codex’s untouched-module gaps → fill with Claude third-audit. Gemini’s validations → unreliable; Gemini’s design dissent → “raises the question” not “v1.0 blocker.”
Round 10 (2026-05-25) — v1.0 pre-tag micro-audit#
Scoped Codex + Gemini micro-audit on the edadddc R9 follow-on commit
only (full ~/Claude/audit-templates/audit-prompt.md template with a
scope-override preamble constraining attention to ~126 LOC src + ~110
LOC tests in _sweep.py + bootstrap.py). Dispatched per the Round
10 locking decision: targeted micro-audit, not full multi-LLM gate
re-run. Codex returned a 311-line report with 3 substantive
findings + per-axis verdicts + 4-test probe verification (all 4
passed). Gemini returned an 86-line report with 2 findings (1
self-recommended-status-quo + 1 fix-recommended on metrics.py
finiteness — direct contradiction with Codex’s “Completeness”
verdict). Claude verification re-read every cited line. Reports
archived locally at codex-microaudit-edadddc-report.md +
gemini-microaudit-edadddc-report.md (gitignored).
Confirmed (3) — disposition mapped#
ID |
Severity |
Source |
Finding |
Disposition |
Commit |
|---|---|---|---|---|---|
R10-F1 |
fix-recommended |
Codex |
|
Fixed in PR (Codex Option C): extend Protocol docstring to document |
R10 follow-on commit on |
R10-F2 |
fix-recommended |
Codex |
|
Fixed in PR (Codex Option A): mirror the production predicate — |
R10 follow-on commit on |
R10-F3 |
minor-observation |
Codex |
|
Fixed in PR (Codex Option A + brief Option B context): update Raises text to “non-positive or non-finite width” + 4-line note explaining scipy BCa NaN-bound motivation. |
R10 follow-on commit on |
Partial (1) — accept-as-design#
ID |
Severity |
Source |
Finding |
Disposition |
|---|---|---|---|---|
R10-RG1 |
minor-observation |
Gemini |
|
Accept-as-design. Gemini’s own recommendation was Option B (status quo): “Deterministic, fast, and covers the primary ‘small n + ceiling metric’ failure mode perfectly.” Output-proxy approach avoids |
Refuted (1) — Pattern-1 calibration record#
ID |
Severity (as Gemini marked it) |
Source |
Finding |
Rationale |
|---|---|---|---|---|
R10-RG2 |
fix-recommended |
Gemini |
“ |
REFUTED. |
Multi-LLM audit-machinery calibration (R10)#
Codex R10 quality: 3/3 substantive findings hold; probe-backed; cited file:line on every observation; honest “What I didn’t look at” calibration. Same depth and rigor as R8 + R9.
Gemini R10 quality: 1/2 findings honest (F1, with self-recommendation = status quo — effectively an open-question framed as a finding); 1/2 REFUTED (F2, Pattern-1 violation). The prompt’s explicit Pattern-1 discipline section did NOT change the outcome. Pattern-1 is now confirmed across R8 V1+V2, R9 RG6, and R10 RG2 — four independent rounds. Conclusion: Gemini’s training-trace defaults toward positive validation without reading; explicit prompt-level discipline is necessary but not sufficient to correct it. Practical implication: future audits should treat Gemini validations as “raises the question” and require Claude (or Codex) read-back before accepting any positive validation.
Round 10 STOP-GATE status: CLOSED with mitigations — 3 Codex confirmed findings fixed in this RC; 1 Gemini accept-as-design; 1 Gemini refuted. v1.0 can tag from the R10 follow-on commit post-CI-green.
Round 11 (2026-05-26) — consumer-feedback-driven validator adoption#
Not a multi-LLM gate-style audit. Round 11 is the first consumer-feedback-driven round: consumer-side audit findings → upstream validator shipped → consumer adopts. Multi-LLM cross-review (Rounds 5–10) was the v1.0 stability-contract activation discipline; Round 11 is the v1.x ongoing consumer-feedback loop. Future Round 12+ would resume multi-LLM dispatch only if v2.0 work or a major audit cycle re-opens. This is flagged so readers don’t confuse R11 with the heavyweight gate-style rounds.
Motivating findings (consumer audit origin)#
V1.3.1 ADR-080 audit-fix (2026-05-22, consumer
prompt-injection-detection-prototype):WRITEUP_NARRATIVE.md:38said “TF-IDF + LR reaches 0.974 AUPRC” but canonical TF-IDF direct val AUPRC = 0.971 (0.974 is LoRA’s value). The existingaudit_numbers.pyvalidates VALUES against source data but not BINDINGS — both 0.971 AND 0.974 exist in the expected set; the bug is the wrong pairing.V1.3.2 P1-2 Part 2 (2026-05-23, consumer same repo):
docs/REPRODUCIBILITY.md:76cited “(locked at Phase 0-07 via ADR-029)” — ADR-029 is the test-marker-strategy ADR, not the reproducibility tier-lock ADR (which is ADR-034). 5-digit-off citation; lychee resolves it (ADR-029 exists),audit_numbers.pyvalidates values, but neither catches semantic mis-citation.
Upstream response timeline#
Time (UTC) |
Event |
|---|---|
2026-05-24 |
Consumer files #71 (value_bindings), #72 (concept_drift), #73 (citation_alignment) at v1.3.3 |
2026-05-24 (later) |
PR #74 opens with audit subpackage skeleton + |
2026-05-25 17:01Z |
v1.0.0 tag (stability contract activates per ADR 0003) |
2026-05-25 17:08Z |
v1.0 snapshot regen hotfix ( |
2026-05-25 22:51Z |
PR #74 merged ( |
2026-05-25 23:48Z |
v1.0.1 ships — |
2026-05-26 01:02Z |
v1.0.2 ships — #76 cleanup batch (RC2 SimilarityStrategy promoted to 10th strict Tier-2; RC3 test row-content hardening; F-metrics docstring polish) |
2026-05-26 01:48Z |
Consumer PR #10 merged at v1.3.7 — consumer-side |
2026-05-26 02:35Z |
v1.0.3 ships — |
End-to-end cycle: V1.3.2 P1-2 origin (2026-05-23) → consumer
adoption (2026-05-26 01:48Z) = ~3 days. Upstream-side v1.0.0 → v1.0.3
(audit-validator family of 2): ~10h compressed cycle. Consumer’s
v1.3.7 PR opened against audit_citation_alignment (the v1.0.1
deliverable); subsequent v1.0.3 audit_value_bindings adoption will
happen on consumer’s own cadence.
Audit-validator family status#
Validator |
Module |
Released |
Issue |
Status |
|---|---|---|---|---|
Citation alignment |
|
v1.0.1 ( |
#73 |
✅ shipped + consumer-adopted (consumer PR #10 v1.3.7) |
Value bindings |
|
v1.0.3 ( |
#71 |
✅ shipped; consumer adoption pending |
Concept drift (embedding-clustering) |
|
future |
#72 |
open; requires MiniLM (already vendored for leakage) + clustering design |
Pre-tag + post-v1.0.3 dogfood summary#
The validators were dogfooded against eval-toolkit’s own markdown surface to prove they work in production before consumer adoption:
v1.0.1 pre-tag dogfood (
validate_citations): 95 files (README + docs/source/**/*.md ex-ADRs ex-Sphinx-generated); 0 misalignments. Helper_build_adr_subjects_from_repo()exercised.Post-v1.0.3 dogfood (re-run + synthetic
validate_reader_value_bindings):Part 1 (citation_alignment expanded): 96 files; 0 misalignments (sanity preserved after audit_value_bindings.md + R10 + R11 additions to docs/source).
Part 2 (value_bindings synthetic): bindings derived from Round 8 RC4-reconciled tally (
("Round 8", "confirmed_count") -> 13etc). Validator processedaudit_findings.md+migration/v0.51.md+CHANGELOG.mdwithout crashing; surfaced 216 pattern-matches + 82 matches. The high false-positive count is expected — the synthetic bindings dict has very loose detector (“Round 8”) + metric (numeric-count) patterns, and any number near a “Round 8” mention triggers. NOT real bugs in eval-toolkit’s docs — just the validator successfully exercising the regex+window pipeline at production-scale markdown.
The dogfood script is preserved at .scratch/dogfood_v1_0_x.py
(gitignored) for future re-runs.
Round 11 outcome#
✅ Action 1 (consumer adoption): consumer PR #10 merged 2026-05-26 01:48Z. Consumer-side dogfood + adoption complete for
validate_citations.✅ Action 2 (this ledger entry): R11 section recorded.
✅ Action 3 (dogfood expansion): validators exercised on eval-toolkit’s own surface; no upstream-side findings.
GH issue #77 closed with state_reason=completed.
Multi-LLM audit cadence after R11#
The Gate 3 audit chain (R5–R10) closed at v1.0 per ADR 0003. R11+ is the post-v1.0 consumer-feedback loop and uses NO multi-LLM cross-review by default. Multi-LLM dispatch resumes only if:
A v2.0 design cycle opens (would require a Gate 3-equivalent re-audit per ADR 0003 §”v2.0 re-audit triggers”).
A consumer files a major-severity finding that single-LLM verification (Claude only) leaves uncertain — in which case Codex
Gemini dispatch via the
audit-prompt.mdtemplate re-engages for that specific question.
Otherwise R11+ is light-touch documentation of the consumer-feedback → upstream-validator → consumer-adopt cycle.
Round 12 (2026-05-26) — schema-extensibility + scope-correctness lesson from #80#
Not a multi-LLM gate-style audit. Round 12 is the second
consumer-feedback-driven round, immediately following R11. The R11
audit-validator family (3/3 shipped) revealed a deeper structural
gap in audit_value_bindings that the v1.0.3 implementation could
not address: the 2-tuple (detector, metric) canonical-identity
schema collapsed across slices, producing ~95 false positives on
the consumer’s writeup. The fix (v1.1.0) is bigger than the issue’s
literal ask — it surfaces two architectural rules that now govern
all audit validators (pending ADR 0005).
Motivating finding (consumer audit origin)#
Consumer’s v1.3.9 run (2026-05-26, prompt-injection-detection-submission):
scripts/audit_value_bindings.pyagainst own writeup produced 96 warnings, ~95 false positives. The one real bug (WRITEUP_PAPER.md:545mis-citing 0.971 inside a misleading paragraph context) was correctly flagged but hid in the noise. Consumer caught it via parallel manual audit, not via this validator. Filed #80 at 2026-05-26T18:15:45Z proposing a 3-tuple(detector, metric, slice)schema extension.
Upstream response timeline#
Time (UTC) |
Event |
|---|---|
2026-05-26T18:15Z |
Consumer files #80 (BINDINGS slice-axis schema) |
2026-05-26T18:30Z |
|
2026-05-26T19:00Z |
|
2026-05-26T19:24Z |
v1.0.5 ships — infrastructure-only release (publish workflow hardening; dress rehearsal for the new Verify PyPI receipt step). |
2026-05-26T19:30Z |
|
2026-05-26T19:45Z |
|
2026-05-26T20:00Z |
First dogfood of v1.1.0 slice-axis fix: only 22% noise reduction (95 → 74). User pushed back: “what is the right long-term solution?” Triggered the deeper rethink. |
2026-05-26T20:10Z |
Identified the second architectural layer: scope correctness. ~80% of v1.0.5 residual noise was content-type confusion (CI brackets, table cells, code blocks), not slice-axis confusion. Added |
2026-05-26T20:11Z |
v1.1.0 ships — |
End-to-end cycle: #80 filing (18:15Z) → v1.1.0 ship (20:11Z) =
~2 hours compressed cycle. Four /exploring-options rounds +
one user-driven architectural redirect (“ultrathink on what the
right long-term solution is”) produced a coherent two-layer fix
instead of the issue’s narrower literal ask.
The two-layer correctness lesson (pending ADR 0005)#
Audit validators have two correctness layers, not one:
Identity correctness — canonical measurements have structured identity (frozen dataclass with named fields), not positional tuples. Future identity axes (split, ci_kind, source_ref, …) added as defaulted fields without breaking the dict-key schema. Positional tuples lock the validator into a recur-every-N-months schema-event pattern.
Scope correctness — the validator should only scan content plausibly a binding claim. Narrative prose sentences are. Markdown table cells aren’t (they’re structured data, audited differently). Bracketed expressions (
[CI 0.286, 0.301]) aren’t (inline expressions, not point estimates). Code blocks aren’t.
This mirrors lint-design conventions from ruff/mypy/bandit
(scope predicates like # noqa/# nosec/# type: ignore are
first-class, not optional).
Dogfood evidence#
Configuration |
Warnings on consumer HEAD |
Reduction vs v1.0.5 baseline |
|---|---|---|
v1.0.5 (legacy 2-tuple) |
95 |
— (baseline) |
v1.1.0 BindingKey + slice-fix only (scope=’all’, default |
74 |
-22% |
v1.1.0 BindingKey + slice-fix + |
23 |
-76% |
The first instinct — “just add a slice axis like the issue asks” — would have left the validator practically unusable on dense academic prose, requiring a separate v1.2.0 cleanup release. The deeper rethink (triggered by the user’s redirect) delivered both layers in a single coherent v1.1.0.
Residual 23 warnings are pre-existing positional-heuristic limitations (sentence-boundary unawareness, multi-detector list parsing on dense prose) not addressable without parser-level work; deferred to future v1.2.0+ as a narrower issue class.
Round 12 outcome#
✅ v1.1.0 shipped (commit
911565b, PyPI live).✅ #80 auto-closed by
closes #80directive on the v1.1.0 tag.✅ 0 open issues on the eval-toolkit repo post-#80.
✅ Architecture lesson codified in pending ADR 0005: the identity + scope two-layer rule applies to future audit validators.
✅ Memory entry:
feedback_validator_identity_plus_scopecaptures the “issue body says X is the problem; investigate the dominant noise source before agreeing” discipline.
Consumer adoption + HARD-gate viability#
Consumer (prompt-injection-detection-submission) currently pins
eval-toolkit>=1.0.3,<2. Adoption path for v1.1.0:
Smallest diff: replace 2-tuple
BINDINGSliteral with 3-tuple keys (issue body’s proposal; works directly).Recommended: migrate to
BindingKey(detector=..., metric=..., slice=...)for forward-extensibility.Add
scope="narrative"to the validator call.Re-run; expect ~76% noise reduction.
HARD-gate promotion at consumer’s v1.3.10+ becomes credible at the ~80% reduction level. Remaining residual false positives can be suppressed via consumer-side filtering (excluding lines containing “random floor” or “versus”) or accepted as known low-frequency noise pending v1.2.0+ pairing-rule improvements.
Multi-LLM audit cadence after R12#
Same as R11: no multi-LLM cross-review for consumer-feedback
rounds; reserved for v2.0 design cycles or major-severity findings
that single-LLM verification cannot resolve. The R12 fix was
informed by the user-driven /exploring-options discipline
(4 rounds + a structural redirect), which substitutes for
multi-LLM dispatch for this class of architectural decision.
Round 13 (2026-05-26) — audit_value_bindings v1.2.0 context-aware noise reduction#
Not a multi-LLM gate-style audit. Third consumer-feedback-driven
round, immediately following R12. R12 closed the slice-axis
schema-extensibility gap (BindingKey + scope=’narrative’ content-type
filter) and reduced the consumer’s noise floor from 96 → 36 warnings
(62%). R13 addresses the residual 36 — the positional-heuristic
limitations ADR 0005 named as “Future work (deferred)” — via four
context-aware extensions to scope='narrative' shipped as v1.2.0.
Motivating evidence (consumer-side, post-v1.3.11)#
Consumer (prompt-injection-detection-submission@v1.3.11) ran the
v1.1.0 audit_value_bindings against their writeup and produced 36
residual false positives. Consumer-side categorization in
decisions/upstream_issues.md:92:
Category |
Count |
Discriminating prose pattern |
|---|---|---|
Random-floor / sub-clause mentions |
~10 |
“random floor of 0.374”, “baseline of 0.374” |
Delta with sign / keyword |
~9 |
“-0.071 AUPRC”, “drops -0.132 below” |
Multi-detector “vs”/”versus” enumeration |
~8 |
“AUPRC 0.556 vs 0.519” (same binding double-flagged) |
Sentence-boundary cross-pairing |
~5 |
“X scored 0.291. The floor is 0.374” |
Genuinely ambiguous (cross-detector list constructions) |
~4 |
“0.293 versus 0.364 for the frozen probe and 0.291 for TF-IDF” |
The first four categories share a structural property: they’re narrative-prose context cues that the v1.1.0 positional heuristic can’t read. ADR 0005’s “Future work (deferred)” section named them explicitly. The fifth category (cross-detector list grammar) is a deeper parser-level problem.
Upstream design — /exploring-options 4 rounds + ultrathink redirect#
Time (UTC) |
Event |
|---|---|
~20:30Z |
|
~20:45Z |
|
~21:00Z |
|
~21:10Z |
|
~21:20Z |
v1.2.0 ships — commit |
Dogfood evidence (compounded across the cycle)#
Release |
Configuration |
Warnings on consumer HEAD |
Reduction vs v1.0.5 |
|---|---|---|---|
v1.0.5 |
Legacy 2-tuple, no scope filter |
95 |
— (baseline) |
v1.1.0 |
|
23 |
-76% |
v1.2.0 |
+ T1–T4 context filters |
7 |
-93% |
The 7 v1.2.0 residuals are all in the “cross-detector list construction” category — prose where the validator can’t infer that subsequent values belong to OTHER detectors via list connectives (“and”, “for X”, “vs”). T3 only deduplicates the SAME binding within one sentence; cross-detector list inference is a parser-level problem.
Tactical refinements during implementation#
Three calibration adjustments emerged from running the test suite + dogfood:
_FLOOR_KEYWORDSnarrowed — initial draft included “baseline”, “prior”, “majority”; removed after the existingtest_detector_with_no_nearby_value_skippedfailed on prose"The TF-IDF baseline performs well"(where “baseline” was a detector descriptor, not a floor reference). Replacement coverage for the legitimate floor cases is provided by T1’s “below” keyword (e.g., “below the prevalence baseline of 0.374” is caught by “below”).T1 keyword window changed to before-only — initial draft used symmetric ±30 chars; the combined-dogfood test failed on
"frozen probe's 0.515 (delta -0.132)"because “delta” (AFTER 0.515) suppressed 0.515. Before-only window correctly fires T1 on-0.132(preceded by “delta”) and leaves 0.515 alone._DELTA_KEYWORDScalibrated — dropped “against” (suppressed legitimate “LoRA’s AUROC is 0.383 against frozen probe’s 0.515” matches). Kept “vs”/”versus” (canonical delta separator). Excluded “above”/”ahead”/”behind” (too ambiguous).
These calibrations are captured in the keyword-list comments in
audit_value_bindings.py so future maintainers see the rationale.
Round 13 outcome#
✅ v1.2.0 shipped (commit
40b3741, PyPI live).✅ ADR 0005 amended: “Future work (deferred)” section notes the v1.2.0 partial closure; sentence-boundary + 3 related context-aware filters resolved; cross-detector list grammar remains v1.3.0+ territory.
✅ Consumer adoption path clear: re-pin
eval-toolkit>=1.2.0,<2; HARD-gate promotion ofaudit_value_bindingsbecomes credible (7 residual < 10 actionable threshold).✅ Memory entry updated:
feedback_validator_identity_plus_scopeextended with the v1.2.0 sequel — the same “identity + scope” framing accommodates the new context-aware filters as scope extensions, vindicating the original two-layer architecture decision.
Multi-LLM audit cadence after R13#
Unchanged from R11/R12: no multi-LLM cross-review for
consumer-feedback rounds. The R13 design was informed by the
user-driven /exploring-options discipline (4 rounds + tactical
calibration during implementation), substituting for multi-LLM
dispatch for this class of incremental refinement. Future Round 14+
would re-engage multi-LLM dispatch only if a v2.0 design cycle
opens or a major-severity consumer finding requires it.
Cross-references#
ADR 0005 — “Future work (deferred)” section now reflects v1.2.0 closure status.
v1.2.0 CHANGELOG entry — full per-filter detail (T1–T4 keyword lists, window calibration, dogfood numbers).
tests/test_audit_value_bindings.py— 36 tests (28 from v1.1.0 + 8 new for T1–T4 + sentence-boundary unit test).
Round 14 (2026-05-26) — Layer 3 pairing rules close cross-detector list-grammar (#81)#
Not a multi-LLM gate-style audit. Fourth consumer-feedback-driven
round in the post-v1.0 cycle (R11 → R12 → R13 → R14). The v1.2.0 cycle
(Round 13) reduced consumer noise from 96 → 4 warnings via the T1–T4
context-aware narrative filters; the residual 4 were all cross-detector
list-grammar / metric-axis confusion cases that the two-layer
identity-+-scope model couldn’t resolve. Consumer filed
#81
documenting the 3 prose patterns + proposing a _LIST_CONNECTIVES
extension; upstream v1.3.0 ships Layer 3 (pairing rules) per
ADR 0006
as the architectural closure.
Motivating residuals from v1.2.0 dogfood#
Consumer’s v1.3.12 dogfood (eval-toolkit v1.2.0 adopted; 36 warnings → narrowed by their additional SKIP_PATTERNS to 4):
File:Line |
Pattern category |
Prose snippet |
|---|---|---|
|
“for X” postfix (Pattern A) |
|
|
possessive + metric confusion (Patterns B + D) |
|
|
group subject (Pattern C) |
|
Upstream design — /exploring-options 2 rounds + Pattern D emerged during dogfood#
Time (UTC) |
Event |
|---|---|
~21:30Z |
Consumer files #81 documenting the 3 prose patterns (A/B/C) + proposing |
~22:00Z |
|
~22:10Z |
|
~22:30Z |
Implementation reaches first dogfood: 4 → 2 warnings. Pattern D emerges: the residual 2 are metric-axis confusion ( |
~22:45Z |
Three tactical calibrations: (a) Pattern A intervening-value check now uses v1.1.0’s |
~22:55Z |
Override-bypass architecture: Pattern A/B now set |
~23:00Z |
v1.3.0 ships — commit |
End-to-end from #81 filing to v1.3.0 ship: ~1.5 hours.
Layer 3 architecture (per ADR 0006)#
Audit validators now have a three-layer correctness model:
Layer |
Dimension |
Mechanism |
Release |
|---|---|---|---|
1 |
Identity |
|
v1.1.0 |
2 |
Scope |
Content-type + context-keyword filters ( |
v1.1.0 + v1.2.0 |
3 |
Pairing |
Override / suppress proximity-based pairing under grammar cues |
v1.3.0 |
Four Layer 3 rules ship in v1.3.0 (all under scope='narrative', no new public kwargs):
Pattern A:
"for {detector}"postfix → override detector pairing (with intervening-value guard via v1.1.0 exclusion-ranges).Pattern B:
"{detector}'s"possessive → override detector pairing (last possessive within 30 chars of value).Pattern C:
"for the {trained|frozen|baseline|all|both|other} detectors"group subject → suppress candidate (with sentence-boundary guard via v1.2.0 sentence-positions).Pattern D: metric-axis nearest-pairing (symmetric to detector-axis). Pre-collects ALL metric positions across consumer-supplied
metric_aliaseskeys, not just binding-derived metrics.
Dogfood evidence (compounded across the four rounds)#
Release |
Configuration |
Warnings on consumer HEAD |
Reduction vs v1.0.5 |
|---|---|---|---|
v1.0.5 |
Legacy 2-tuple, no scope filter |
95 |
— (baseline) |
v1.1.0 |
BindingKey + scope=’narrative’ content-type |
23 |
-76% |
v1.2.0 |
+ T1–T4 context filters |
7 |
-93% |
v1.3.0 |
+ Patterns A/B/C/D pairing rules |
0 |
-100% |
Round 14 outcome#
✅ v1.3.0 shipped (commit
4c2742e, PyPI live).✅ #81 auto-closed by
closes #81directive on the v1.3.0 tag.✅ 0 open issues on the eval-toolkit repo post-#81.
✅ ADR 0006 committed — Layer 3 pairing rules formally codified as the third correctness layer.
✅ Consumer HARD-gate promotion now credible. With 0 residual warnings, the consumer can promote
audit_value_bindingsfrom SOFT to HARD bundled withaudit_citation_alignmentper their v1.3.8 plan.
The post-v1.0 consumer-feedback cycle (R11 → R14)#
The four post-v1.0 rounds form a complete consumer-feedback loop:
Round |
Driver |
Cycle time |
Closure |
|---|---|---|---|
R11 |
Consumer adopts v1.0.x audit-validator family |
days |
v1.0.4 (3 validators shipped) |
R12 |
Consumer files #80 (BINDINGS slice-axis) |
~2 hours |
v1.1.0 + v1.2.0 ( |
R13 |
v1.1.0 dogfood surfaces context-filter gaps |
~1 hour |
v1.2.0 (T1–T4 context filters; 95 → 23) |
R14 |
Consumer files #81 (cross-detector list-grammar) |
~1.5 hours |
v1.3.0 (Layer 3 pairing rules; 4 → 0) |
The pattern is consistent: consumer surfaces a structural gap → upstream files an ADR-driven design → ships within hours → consumer adopts same-day. The 100% noise reduction over four rounds validates the iterative consumer-feedback model as a substitute for heavyweight multi-LLM gate audits (which closed at v1.0 per ADR 0003).
Multi-LLM audit cadence after R14#
Unchanged from R11/R12/R13: no multi-LLM cross-review for
consumer-feedback rounds. R14’s design was informed by
/exploring-options (2 rounds) + Pattern D emergence during dogfood
3 tactical calibrations during implementation. Multi-LLM dispatch remains reserved for v2.0 design cycles or major-severity findings.
Cross-references#
ADR 0006 — Layer 3 codification.
v1.3.0 CHANGELOG entry — full per-pattern detail with consumer adoption path.
tests/test_audit_value_bindings.py— 43 tests (36 from v1.2.0 + 7 for Patterns A/B/C/D + backward-compat + combined dogfood).
Round 15 (2026-05-26) — audit_citation_alignment Layer 2 + Layer 3 + ADR 0007 family-wide architecture#
Not a multi-LLM gate-style audit. Fifth consumer-feedback-driven
round, opening the equivalent cycle for audit_citation_alignment
that R11→R14 closed for audit_value_bindings. Consumer filed
eval-toolkit#82
concurrent with their v1.3.13 adoption: 188 residual warnings on
audit_citation_alignment, same architectural-class gap (Layer 2 +
Layer 3 context-awareness) that audit_value_bindings worked through
across three releases.
Architectural insight: ADR 0007 generalizes the model#
ADR 0005 (Layer 1 + 2) and ADR 0006 (Layer 3) were originally
framed as audit_value_bindings-specific. The v1.4.0 cycle adopting
the same architecture for audit_citation_alignment revealed that
the three-layer model is the canonical architecture for the
audit-validator family — not just one validator. ADR 0007 codifies
this family-wide.
The v1.4.0 release also extracted shared narrative-prose helpers
into private flat module eval_toolkit/_narrative.py (consistent
with ADR 0001’s _rng.py/_parallel.py/_sweep.py precedent).
Both validators import from this module; future audit validators
inherit the same primitives.
Three-layer model applied to audit_citation_alignment#
Pattern |
Layer |
Mechanism |
|---|---|---|
β |
Layer 2 |
Exclude citations inside markdown table rows, bracketed expressions, fenced code blocks (reuses v1.1.0 |
γ |
Layer 3 |
Category-keyword extraction window bounded by the SENTENCE containing the citation (reuses v1.2.0 |
α |
Layer 3 |
Multi-category set membership: when the sentence matches multiple category keywords (multi-topic prose), accept the citation if the ADR’s actual category is in the set. |
None-skip |
Layer 2 refinement |
Defer (skip the citation) when |
Upstream design — /exploring-options 3 rounds + Pattern α generalization + None-skip emerged during dogfood#
Time (UTC) |
Event |
|---|---|
~21:30Z |
Consumer files #82 documenting 3 prose patterns (α/β/γ) + 188-warning baseline. |
~22:00Z |
|
~22:10Z |
|
~22:20Z |
|
~22:30Z |
Implementation reaches first dogfood: 188 → 101 (46% reduction). Pattern β + γ + initial α (multi-citation only) firing. |
~22:40Z |
Symmetric-None skip emerges: dominant residual is |
~22:50Z |
Pattern α generalized: extend from “multi-citation sentences only” to “any multi-topic sentence” (any sentence matching ≥2 category keywords). 44 → 37. |
~22:55Z |
Decision: ship at 37 rather than over-engineer. Above the #82 ≤20 target but a 5× reduction; residual 37 includes real misalignments worth consumer-triage + edge cases requiring parser-level understanding. |
~23:00Z |
v1.4.0 ships — commit |
End-to-end from #82 filing to v1.4.0 ship: ~1.5 hours.
Dogfood evidence (audit_citation_alignment cycle)#
Release |
Configuration |
Warnings on consumer HEAD |
Reduction vs v1.3.0 baseline |
|---|---|---|---|
v1.3.0 (audit_citation_alignment with scope=’all’) |
188 |
— |
(baseline) |
v1.4.0 ( |
37 |
-80% |
The residual 37 is above the original #82 acceptance criterion (≤20) but represents a 5× reduction. Categorized:
Real misalignments consumer should triage (e.g.,
ADR-025cited for a threshold claim when ADR-025 is the cost ADR — could be a wrong-ADR bug or a multi-topic ADR not captured by the consumer’s category-keyword map).Single-topic-sentence edge cases: Pattern α only fires for multi-topic sentences (≥2 category keywords matched). Sentences with one dominant topic + a cross-category ADR cite stay on the legacy first-match check.
Multi-topic ADRs where the consumer’s category map assigns a single category but the ADR genuinely covers multiple topics. Consumer-side category-map expansion would address these.
Round 15 outcome#
✅ v1.4.0 shipped (commit
7c40490, PyPI live).✅ #82 auto-closed by
closes #82directive on the v1.4.0 tag.✅ 0 open issues on the eval-toolkit repo post-#82.
✅ ADR 0007 committed — three-layer architecture (identity + scope + pairing) codified as canonical for ALL audit_* validators. ADRs 0005/0006 now framed as family-wide via 0007.
✅
_narrative.pyextracted — shared narrative-prose helpers; both validators import from one canonical location. Signature-preserving refactor; all v1.3.0 tests pass unchanged.⚠ Consumer HARD-gate promotion: still a judgment call. The residual 37 includes some real misalignments worth fixing in consumer prose; HARD-gating now would block commits on those lines. Consumer’s v1.3.8 bundled-promotion plan should proceed only after triaging the 37.
The post-v1.0 consumer-feedback cycle (R11 → R15)#
The five rounds form a complete library-first cycle for the audit-validator family:
Round |
Driver |
Cycle time |
Closure |
|---|---|---|---|
R11 |
Consumer adopts v1.0.x audit-validator family |
days |
v1.0.4 (3 validators shipped) |
R12 |
Consumer files #80 (BINDINGS slice-axis) |
~2h |
v1.1.0 + v1.2.0 |
R13 |
v1.1.0 dogfood surfaces context-filter gaps |
~1h |
v1.2.0 |
R14 |
Consumer files #81 (audit_value_bindings list-grammar) |
~1.5h |
v1.3.0 (Layer 3) |
R15 (this) |
Consumer files #82 (audit_citation_alignment scope+pairing) |
~1.5h |
v1.4.0 (Layer 2 + 3 + ADR 0007) |
R11→R15 demonstrates the iterative consumer-feedback model as a substitute for heavyweight multi-LLM gate audits (closed at v1.0 per ADR 0003). Each round delivered closure within hours of the consumer filing, with same-day adoption. ADR 0007 captures the architectural template; future audit validators inherit the three-layer model and ship in fewer cycles.
Multi-LLM audit cadence after R15#
Unchanged from R11–R14: no multi-LLM cross-review for
consumer-feedback rounds. R15’s design was informed by
/exploring-options (3 rounds) + 2 dogfood-driven refinements
(Pattern α generalization + symmetric-None skip). Multi-LLM
dispatch remains reserved for v2.0 design cycles or major-
severity findings.
Cross-references#
ADR 0007 — family-wide three-layer codification.
v1.4.0 CHANGELOG entry — full per-pattern detail with consumer adoption path.
tests/test_audit_citation_alignment.py— 18 tests (12 from v1.0.1 + 6 for v1.4.0 Patterns α/β/γ + backward-compat + shared-helpers + combined dogfood).