ADR 0003: v1.0 stability contract + Gate 3 methodology#
Status: Accepted Date: 2026-05-21 (drafted at v0.46-prep; finalized at v0.48) Deciders: Brandon Behring (author), Round 5 / Round 6 / Round 7 audits (Codex + Gemini) Supersedes: N/A. Superseded by: N/A.
Context#
v1.0 is eval-toolkit’s first stability commitment release. Two questions must be answered precisely or the commitment is unspecified:
Stability contract scope. What exactly is frozen at v1.0? The maximally-strict reading — every public symbol, every signature, every docstring first line — taxes ongoing maintenance heavily. The maximally-loose reading — only top-level
__all__— gives consumers a weak guarantee and forces submodule users to absorb refactoring as surprise breaking changes.Gate 3 methodology evidence. The original v1.0 plan called for “methodology docs peer-reviewed by an external reader (statistics / methodology background, ideally not part of the
prompt_injection_*core team).” For a single-author / single-consumer / one-month release window, identifying an external academic reviewer with the right expertise is high-variance calendar work. The plan-time alternative — multi-model LLM cross-review (author + Codex + Gemini) — is operationally cheaper but not the same evidence class.
Round 5 audit (Codex F7) flagged the second issue specifically: the plan as originally drafted understated the difference between external human peer review and model-assisted cross-review.
Decision#
Two interlinked decisions, finalized here:
1. Tiered stability contract (Decision M from the v1.0 plan)#
v1.0 commits to three tiers of stability, each with explicit SemVer-version cost-to-change:
Tier 1: STRICT (SemVer-major to change)#
The following are part of the v1.0 strict contract and require a v2.0 bump to alter:
Symbols in
eval_toolkit.__all__(i.e., everything in_EXPORTSat v1.0 tag time). Examples:scorecard,Scorecard,MetricSpec,MetricResult,bootstrap_ci,BootstrapCI,evaluate,RunManifest,LogisticStacker,MetaLearner, etc.Their signatures — parameter names, defaults, kwarg-only markers, return types. (Amended 2026-06-10, #101/#93: a strictly appended optional parameter — a new keyword-only kwarg with a default, or a new trailing dataclass field with a default that does not perturb positional construction, equality/hash semantics, or serialized schemas — is SemVer-MINOR, provided the public-API snapshot is regenerated in the same commit so the change is reviewed, not accidental. Any other signature change — renames, removals, reorderings, default changes, return-type changes, non-trailing or non-defaulted field insertions — remains SemVer-major. This codifies the practice already shipped at v1.5.0 (
HFDatasetsLoaderkwargs; recorded as erratum #101) and v1.9.0 (#93return_samples/BootstrapCI.samples). The original text routed every signature change to “the additive-Protocol path or accept the major bump”, which provides no path at all for functions and dataclasses.)Tier-2 Protocols (9 strict at v1.0, all shipped through v0.47.0; 10 strict since v1.0.2 — see the
SimilarityStrategyentry below):Scorer(fromprotocols.py)LeakageCheck(fromleakage.py)Splitter(fromsplits.py)ThresholdSelector(fromthresholds.py)DatasetLoader(fromloaders.py)MetricSpec(fromscorecards.pysince v0.49.0; was_scorecard.pyv0.46–v0.48)TextTransform(fromprotocols.py, shipped v0.47)MetaLearner(fromstacking.py, shipped v0.45)Probe(fromprobes.py, shipped v0.43)SimilarityStrategy(fromtext_dedup.py; promoted to the 10th strict Protocol at v1.0.2 per the #76 RC2 reconciliation — a documentation-only promotion, the contract was already pinned in the public-API snapshot since v1.0.0. Amended here 2026-06-09.)PLUS 1 opt-in Protocol:
Versioned(additive on top of Tier-2).
Tier-2 Protocol method signatures are pinned by the Decision R6-D drift guard in
tests/test_public_api.py(since v0.47.0): the golden snapshot captures each Protocol’s method signatures + return types viatyping.get_type_hints+inspect.signature, so a SemVer-major review fires on any shape change.Current versioned JSON schemas (
src/eval_toolkit/schemas/*.json) per artifact type —manifest.v3.json(canonical),manifest.v1.jsonmanifest.v2.json(historical, kept for migration),ood_manifest.v1.json,results.v1.json,results_full.v1.json. Schema-version bumps within the v1.x line are additive only (e.g., publishingmanifest.v4.jsonis fine; renaming required fields inmanifest.v3.jsonis not).
Tier 2: ADDITIVE-ONLY (SemVer-minor for changes)#
The following may grow in v1.x minors but may not lose existing functionality:
Submodule public symbols (e.g.,
eval_toolkit.metrics.pr_auc,eval_toolkit.bootstrap.paired_bootstrap_diff,eval_toolkit.calibration.fit_platt_calibrator). These are documented as internal API per ADR 0002 — not part of Tier 1’s strict freeze, but eval-toolkit commits to not removing them during v1.x. Refactor that adds new helpers + keeps old paths importable is acceptable.Tier-2 Protocols can gain optional methods via subprotocols — e.g., a future
MetricSpecsubprotocol could addis_defined_on(y_true) -> tuple[bool, str]for richer cell-state detection. Implementations that don’t implement the subprotocol continue to satisfy the base Protocol; this is additive.Amendment (2026-06-12, v1.12.0): Tier-2 surfaces may change beyond strictly-additive (parameter renames; eventual removals) in a SemVer-MINOR only through the deprecation process documented in
docs/source/DEPRECATION.md— the standard 2-minor warning window, or its one-time-exception criteria (small verified consumer set, cross-repo notification, alias-debt justification, announce-time documentation). The symbol itself must remain importable through v1.x. First use: theeval_toolkit.edaSPEC-7 parameter renames at v1.12.0 (#100, exception path).
Tier 3: FREE (SemVer-patch)#
The following may change in any release:
Docstring first lines —
tests/test_public_api.pycaptures them as part of the public-API snapshot. (Amended 2026-06-09: the v1.0 plan to drop this capture — or gate it behind aSTRICT_DOCSTRINGS=1env var — was never implemented. Through v1.x, docstring first lines remain pinned, so docstring polish requires a snapshot regen in the same commit; the SemVer tier of such a change is still Tier-3/patch.)Implementation internals (helpers without
__all__entries, private_-prefixedsymbols, module-level constants not in any__all__). These may move, rename, or disappear without a SemVer signal.Error message wording. The error TYPES (ValueError vs RuntimeError) are stable; the human-readable message text is not.
2. Gate 3 methodology cross-review#
Gate 3 at v1.0 is internal model-assisted cross-review, NOT external academic peer review. The process:
Manual review by the author. Author reads
docs/source/methodology/(16 chapters) + new feature pages + ADRs 0001 / 0002 / 0003 / this file + the v1.0 plan with a critic’s eye. The author is NOT an external reviewer; this is the highest-context but weakest-independence of the three reads.Codex independent report. Provide the methodology curriculum + plan + relevant source files; ask for a methodology-focused review identifying gaps, contradictions, unstated assumptions, and statistical-correctness issues. Codex’s training corpus differs from Anthropic’s, so it surfaces issues Anthropic-trained reasoning may miss.
Gemini independent report. Same packet, different model. Google’s training corpus differs from both Anthropic’s and OpenAI’s, so it provides a third independent read.
Honest framing (per Audit F7): this is not equivalent to external academic peer review.
What multi-LLM cross-review catches well (demonstrated in Round 5): plan/code contradictions, references to symbols that don’t exist, doc-code drift, load-bearing instruction errors in implementation directives, mathematical claims that don’t match implementation, public-API status contradictions. Round 5 surfaced 12 verified-real findings before any v0.46 code shipped.
What multi-LLM cross-review catches less reliably: methodological judgments that depend on domain expertise outside the models’ training corpora, deep statistical correctness on novel methods, whether the chosen methodology serves the practitioner’s actual decision-making need.
What it does NOT substitute for: external accountability. There is no third party who will be embarrassed if v1.0 turns out to have a methodology bug.
Cycle structure (per Decision Y.2):
Round 5 (pre-v0.46 implementation) — complete 2026-05-21. Surfaced 12 verified-real findings, including F1 / F2 (scorecard shape blockers) and F7 (Gate 3 honesty — this ADR’s framing).
Round 6 (post-v0.46 ship) — complete 2026-05-21. STOP-GATE CLOSED before v0.47 opened. Codex R6-F1 (ECE strategy validation)
Codex R6-F2 (deprecation warning content) shipped as v0.46.1 hotfix per Decision R6-E; the remaining findings (R6-A through R6-H) folded into v0.47.0.
Round 7 (post-v0.47 ship) — complete 2026-05-21. STOP-GATE CLOSED before v0.48 opened. Codex R7-F1 (MyST-NB doc-execution gap)
R7-F2 (sweep strategy_id) + R7-F3 (sweep scorer output shape) folded into v0.48.0; Gemini observations folded into v0.48 documentation polish (§5K) + the Makefile pre-push target (§5L).
Round 8 (post-v0.48 ship) — final pre-v1.0 packet review.
v1.0 tag requires Round 8 to close (or fall back per ADR amendment).
Findings ledger: docs/source/audit_findings.md
records each round + the disposition of every blocker-severity finding.
Blocker findings also get a p1-gate3-labelled GitHub issue for
fix-tracking.
Escalation path (if a Round flags methodological judgment beyond LLM capability): consult a human reviewer for the specific narrow question. Don’t require a full-curriculum external read; do require a human signoff on any methodology claim that surfaces as “models disagree” or “models flag uncertainty without confidence.”
Consequences#
Positive:
The tiered contract gives consumers a clear answer to “what can I rely on?” — Tier 1 is bedrock; Tier 2 is documented-as-additive; Tier 3 is understood-to-evolve.
Submodule users (the
eval_toolkit.metrics.*escape hatch) get an explicit additive-only commitment for v1.x — not the maximally-loose “no commitment” of a narrow contract.Gate 3 honesty: future maintainers can read this ADR and understand exactly what evidence the v1.0 release carries (and doesn’t).
Rounds 5 / 6 / 7 demonstrated that multi-LLM cross-review catches real issues at each stage (Round 5: 12 findings, Round 6: 11 findings, Round 7: 3 substantive Codex findings + 6 Gemini observations); the cycle structure ensures coverage of every breaking minor before the final stability tag.
Negative:
The Tier 1 commitment to Tier-2 Protocol shapes makes future Protocol-evolution work harder. v1.x can ADD subprotocols but cannot MODIFY base method signatures without v2.0.
The tiered contract is more documentation surface to maintain than a flat “everything is strict” or “only
__all__is strict” contract.Multi-LLM cross-review has no formal external-accountability story. If a v1.0 methodology bug surfaces post-release, no third party shares responsibility for missing it.
Alternatives considered#
Maximally-strict contract (every public symbol, signature, docstring)#
Reject because the docstring-first-line capture would tax every documentation-polish PR with a SemVer concern. Test snapshot would trigger on every doc improvement.
Maximally-loose contract (only top-level __all__)#
Reject because the submodule escape hatch is real and used. Without a
commitment, consumer code at from eval_toolkit.metrics import expected_calibration_error could break without SemVer warning, defeating
the consumer’s purpose for pinning to a stable version.
Strict + flat (no tiering, freeze submodule too)#
Reject because the v1.x line needs room to grow. The 4-binary-calibrator family added across v0.35 → v0.42 was exactly the kind of additive submodule evolution the flat-strict contract would have penalized.
External academic peer review as required Gate 3#
Reject because the single-author, one-month release window made reviewer-identification calendar risk a hard blocker. The Round 5 audit demonstrated multi-LLM cross-review surfaces real findings (12 verified) quickly; the trade-off (no external accountability) is documented here rather than overclaimed.
Defer Gate 3 entirely (self-review only)#
Reject because Round 5 demonstrated that even at the author’s level of familiarity with the plan, Codex + Gemini each caught issues the author missed. Removing the cross-review step would surrender that benefit.
Trigger to revisit#
This ADR is locked at v1.0. Revisiting requires SemVer-major (v2.0) or explicit ADR amendment. Likely triggers:
Second production consumer with materially different access patterns (e.g., heavy submodule use that exposes additive-only ambiguity).
A v1.x methodology bug that multi-LLM cross-review missed but external review would have caught. In that case, re-tighten Gate 3 for v2.0+ to require human review for specific claim types.
Schema bumps beyond v3 (manifest schema is currently at v3). If v1.x publishes
manifest.v4.jsonas canonical, this ADR’s schema-list is amended (additive change to Tier 1 — does NOT require v2.0).
References#
v1.0 plan:
~/.claude/plans/evaluate-all-the-work-twinkly-kite.md(Decisions M + O).Round 5 audit ledger:
audit_findings.md— Audit F7 (Gate 3 honesty) is the key finding for this ADR.ADR 0001 — flat module layout — the module-layout contract that operates inside the Tier-1 stability commitment defined here.
ADR 0002 — scorecard as primary metric surface — adds
MetricSpecto Tier-2 Protocols list above; demoteseval_toolkit.metrics.*to the Tier-2 additive-only commitment.tests/test_public_api.py— captures docstring first lines; the planned v1.0 skip was not implemented (see the amended Tier-3 note above).