ADR 0003: v1.0 stability contract + Gate 3 methodology#

Status: Accepted Date: 2026-05-21 (drafted at v0.46-prep; finalized at v0.48) Deciders: Brandon Behring (author), Round 5 / Round 6 / Round 7 audits (Codex + Gemini) Supersedes: N/A. Superseded by: N/A.

Context#

v1.0 is eval-toolkit’s first stability commitment release. Two questions must be answered precisely or the commitment is unspecified:

  1. Stability contract scope. What exactly is frozen at v1.0? The maximally-strict reading — every public symbol, every signature, every docstring first line — taxes ongoing maintenance heavily. The maximally-loose reading — only top-level __all__ — gives consumers a weak guarantee and forces submodule users to absorb refactoring as surprise breaking changes.

  2. Gate 3 methodology evidence. The original v1.0 plan called for “methodology docs peer-reviewed by an external reader (statistics / methodology background, ideally not part of the prompt_injection_* core team).” For a single-author / single-consumer / one-month release window, identifying an external academic reviewer with the right expertise is high-variance calendar work. The plan-time alternative — multi-model LLM cross-review (author + Codex + Gemini) — is operationally cheaper but not the same evidence class.

Round 5 audit (Codex F7) flagged the second issue specifically: the plan as originally drafted understated the difference between external human peer review and model-assisted cross-review.

Decision#

Two interlinked decisions, finalized here:

1. Tiered stability contract (Decision M from the v1.0 plan)#

v1.0 commits to three tiers of stability, each with explicit SemVer-version cost-to-change:

Tier 1: STRICT (SemVer-major to change)#

The following are part of the v1.0 strict contract and require a v2.0 bump to alter:

  • Symbols in eval_toolkit.__all__ (i.e., everything in _EXPORTS at v1.0 tag time). Examples: scorecard, Scorecard, MetricSpec, MetricResult, bootstrap_ci, BootstrapCI, evaluate, RunManifest, LogisticStacker, MetaLearner, etc.

  • Their signatures — parameter names, defaults, kwarg-only markers, return types. (Amended 2026-06-10, #101/#93: a strictly appended optional parameter — a new keyword-only kwarg with a default, or a new trailing dataclass field with a default that does not perturb positional construction, equality/hash semantics, or serialized schemas — is SemVer-MINOR, provided the public-API snapshot is regenerated in the same commit so the change is reviewed, not accidental. Any other signature change — renames, removals, reorderings, default changes, return-type changes, non-trailing or non-defaulted field insertions — remains SemVer-major. This codifies the practice already shipped at v1.5.0 (HFDatasetsLoader kwargs; recorded as erratum #101) and v1.9.0 (#93 return_samples / BootstrapCI.samples). The original text routed every signature change to “the additive-Protocol path or accept the major bump”, which provides no path at all for functions and dataclasses.)

  • Tier-2 Protocols (9 strict at v1.0, all shipped through v0.47.0; 10 strict since v1.0.2 — see the SimilarityStrategy entry below):

    • Scorer (from protocols.py)

    • LeakageCheck (from leakage.py)

    • Splitter (from splits.py)

    • ThresholdSelector (from thresholds.py)

    • DatasetLoader (from loaders.py)

    • MetricSpec (from scorecards.py since v0.49.0; was _scorecard.py v0.46–v0.48)

    • TextTransform (from protocols.py, shipped v0.47)

    • MetaLearner (from stacking.py, shipped v0.45)

    • Probe (from probes.py, shipped v0.43)

    • SimilarityStrategy (from text_dedup.py; promoted to the 10th strict Protocol at v1.0.2 per the #76 RC2 reconciliation — a documentation-only promotion, the contract was already pinned in the public-API snapshot since v1.0.0. Amended here 2026-06-09.)

    • PLUS 1 opt-in Protocol: Versioned (additive on top of Tier-2).

    Tier-2 Protocol method signatures are pinned by the Decision R6-D drift guard in tests/test_public_api.py (since v0.47.0): the golden snapshot captures each Protocol’s method signatures + return types via typing.get_type_hints + inspect.signature, so a SemVer-major review fires on any shape change.

  • Current versioned JSON schemas (src/eval_toolkit/schemas/*.json) per artifact type — manifest.v3.json (canonical), manifest.v1.json

    • manifest.v2.json (historical, kept for migration), ood_manifest.v1.json, results.v1.json, results_full.v1.json. Schema-version bumps within the v1.x line are additive only (e.g., publishing manifest.v4.json is fine; renaming required fields in manifest.v3.json is not).

Tier 2: ADDITIVE-ONLY (SemVer-minor for changes)#

The following may grow in v1.x minors but may not lose existing functionality:

  • Submodule public symbols (e.g., eval_toolkit.metrics.pr_auc, eval_toolkit.bootstrap.paired_bootstrap_diff, eval_toolkit.calibration.fit_platt_calibrator). These are documented as internal API per ADR 0002 — not part of Tier 1’s strict freeze, but eval-toolkit commits to not removing them during v1.x. Refactor that adds new helpers + keeps old paths importable is acceptable.

  • Tier-2 Protocols can gain optional methods via subprotocols — e.g., a future MetricSpec subprotocol could add is_defined_on(y_true) -> tuple[bool, str] for richer cell-state detection. Implementations that don’t implement the subprotocol continue to satisfy the base Protocol; this is additive.

  • Amendment (2026-06-12, v1.12.0): Tier-2 surfaces may change beyond strictly-additive (parameter renames; eventual removals) in a SemVer-MINOR only through the deprecation process documented in docs/source/DEPRECATION.md — the standard 2-minor warning window, or its one-time-exception criteria (small verified consumer set, cross-repo notification, alias-debt justification, announce-time documentation). The symbol itself must remain importable through v1.x. First use: the eval_toolkit.eda SPEC-7 parameter renames at v1.12.0 (#100, exception path).

Tier 3: FREE (SemVer-patch)#

The following may change in any release:

  • Docstring first linestests/test_public_api.py captures them as part of the public-API snapshot. (Amended 2026-06-09: the v1.0 plan to drop this capture — or gate it behind a STRICT_DOCSTRINGS=1 env var — was never implemented. Through v1.x, docstring first lines remain pinned, so docstring polish requires a snapshot regen in the same commit; the SemVer tier of such a change is still Tier-3/patch.)

  • Implementation internals (helpers without __all__ entries, private _-prefixed symbols, module-level constants not in any __all__). These may move, rename, or disappear without a SemVer signal.

  • Error message wording. The error TYPES (ValueError vs RuntimeError) are stable; the human-readable message text is not.

2. Gate 3 methodology cross-review#

Gate 3 at v1.0 is internal model-assisted cross-review, NOT external academic peer review. The process:

  1. Manual review by the author. Author reads docs/source/methodology/ (16 chapters) + new feature pages + ADRs 0001 / 0002 / 0003 / this file + the v1.0 plan with a critic’s eye. The author is NOT an external reviewer; this is the highest-context but weakest-independence of the three reads.

  2. Codex independent report. Provide the methodology curriculum + plan + relevant source files; ask for a methodology-focused review identifying gaps, contradictions, unstated assumptions, and statistical-correctness issues. Codex’s training corpus differs from Anthropic’s, so it surfaces issues Anthropic-trained reasoning may miss.

  3. Gemini independent report. Same packet, different model. Google’s training corpus differs from both Anthropic’s and OpenAI’s, so it provides a third independent read.

Honest framing (per Audit F7): this is not equivalent to external academic peer review.

  • What multi-LLM cross-review catches well (demonstrated in Round 5): plan/code contradictions, references to symbols that don’t exist, doc-code drift, load-bearing instruction errors in implementation directives, mathematical claims that don’t match implementation, public-API status contradictions. Round 5 surfaced 12 verified-real findings before any v0.46 code shipped.

  • What multi-LLM cross-review catches less reliably: methodological judgments that depend on domain expertise outside the models’ training corpora, deep statistical correctness on novel methods, whether the chosen methodology serves the practitioner’s actual decision-making need.

  • What it does NOT substitute for: external accountability. There is no third party who will be embarrassed if v1.0 turns out to have a methodology bug.

Cycle structure (per Decision Y.2):

  • Round 5 (pre-v0.46 implementation) — complete 2026-05-21. Surfaced 12 verified-real findings, including F1 / F2 (scorecard shape blockers) and F7 (Gate 3 honesty — this ADR’s framing).

  • Round 6 (post-v0.46 ship) — complete 2026-05-21. STOP-GATE CLOSED before v0.47 opened. Codex R6-F1 (ECE strategy validation)

    • Codex R6-F2 (deprecation warning content) shipped as v0.46.1 hotfix per Decision R6-E; the remaining findings (R6-A through R6-H) folded into v0.47.0.

  • Round 7 (post-v0.47 ship) — complete 2026-05-21. STOP-GATE CLOSED before v0.48 opened. Codex R7-F1 (MyST-NB doc-execution gap)

    • R7-F2 (sweep strategy_id) + R7-F3 (sweep scorer output shape) folded into v0.48.0; Gemini observations folded into v0.48 documentation polish (§5K) + the Makefile pre-push target (§5L).

  • Round 8 (post-v0.48 ship) — final pre-v1.0 packet review.

  • v1.0 tag requires Round 8 to close (or fall back per ADR amendment).

Findings ledger: docs/source/audit_findings.md records each round + the disposition of every blocker-severity finding. Blocker findings also get a p1-gate3-labelled GitHub issue for fix-tracking.

Escalation path (if a Round flags methodological judgment beyond LLM capability): consult a human reviewer for the specific narrow question. Don’t require a full-curriculum external read; do require a human signoff on any methodology claim that surfaces as “models disagree” or “models flag uncertainty without confidence.”

Consequences#

Positive:

  • The tiered contract gives consumers a clear answer to “what can I rely on?” — Tier 1 is bedrock; Tier 2 is documented-as-additive; Tier 3 is understood-to-evolve.

  • Submodule users (the eval_toolkit.metrics.* escape hatch) get an explicit additive-only commitment for v1.x — not the maximally-loose “no commitment” of a narrow contract.

  • Gate 3 honesty: future maintainers can read this ADR and understand exactly what evidence the v1.0 release carries (and doesn’t).

  • Rounds 5 / 6 / 7 demonstrated that multi-LLM cross-review catches real issues at each stage (Round 5: 12 findings, Round 6: 11 findings, Round 7: 3 substantive Codex findings + 6 Gemini observations); the cycle structure ensures coverage of every breaking minor before the final stability tag.

Negative:

  • The Tier 1 commitment to Tier-2 Protocol shapes makes future Protocol-evolution work harder. v1.x can ADD subprotocols but cannot MODIFY base method signatures without v2.0.

  • The tiered contract is more documentation surface to maintain than a flat “everything is strict” or “only __all__ is strict” contract.

  • Multi-LLM cross-review has no formal external-accountability story. If a v1.0 methodology bug surfaces post-release, no third party shares responsibility for missing it.

Alternatives considered#

Maximally-strict contract (every public symbol, signature, docstring)#

Reject because the docstring-first-line capture would tax every documentation-polish PR with a SemVer concern. Test snapshot would trigger on every doc improvement.

Maximally-loose contract (only top-level __all__)#

Reject because the submodule escape hatch is real and used. Without a commitment, consumer code at from eval_toolkit.metrics import expected_calibration_error could break without SemVer warning, defeating the consumer’s purpose for pinning to a stable version.

Strict + flat (no tiering, freeze submodule too)#

Reject because the v1.x line needs room to grow. The 4-binary-calibrator family added across v0.35 → v0.42 was exactly the kind of additive submodule evolution the flat-strict contract would have penalized.

External academic peer review as required Gate 3#

Reject because the single-author, one-month release window made reviewer-identification calendar risk a hard blocker. The Round 5 audit demonstrated multi-LLM cross-review surfaces real findings (12 verified) quickly; the trade-off (no external accountability) is documented here rather than overclaimed.

Defer Gate 3 entirely (self-review only)#

Reject because Round 5 demonstrated that even at the author’s level of familiarity with the plan, Codex + Gemini each caught issues the author missed. Removing the cross-review step would surrender that benefit.

Trigger to revisit#

This ADR is locked at v1.0. Revisiting requires SemVer-major (v2.0) or explicit ADR amendment. Likely triggers:

  • Second production consumer with materially different access patterns (e.g., heavy submodule use that exposes additive-only ambiguity).

  • A v1.x methodology bug that multi-LLM cross-review missed but external review would have caught. In that case, re-tighten Gate 3 for v2.0+ to require human review for specific claim types.

  • Schema bumps beyond v3 (manifest schema is currently at v3). If v1.x publishes manifest.v4.json as canonical, this ADR’s schema-list is amended (additive change to Tier 1 — does NOT require v2.0).

References#

  • v1.0 plan: ~/.claude/plans/evaluate-all-the-work-twinkly-kite.md (Decisions M + O).

  • Round 5 audit ledger: audit_findings.md — Audit F7 (Gate 3 honesty) is the key finding for this ADR.

  • ADR 0001 — flat module layout — the module-layout contract that operates inside the Tier-1 stability commitment defined here.

  • ADR 0002 — scorecard as primary metric surface — adds MetricSpec to Tier-2 Protocols list above; demotes eval_toolkit.metrics.* to the Tier-2 additive-only commitment.

  • tests/test_public_api.py — captures docstring first lines; the planned v1.0 skip was not implemented (see the amended Tier-3 note above).