# ADR 0003: v1.0 stability contract + Gate 3 methodology

**Status:** Accepted
**Date:** 2026-05-21 (drafted at v0.46-prep; finalized at v0.48)
**Deciders:** Brandon Behring (author), Round 5 / Round 6 / Round 7 audits
(Codex + Gemini)
**Supersedes:** N/A. **Superseded by:** N/A.

## Context

v1.0 is eval-toolkit's first **stability commitment** release. Two questions
must be answered precisely or the commitment is unspecified:

1. **Stability contract scope.** What exactly is frozen at v1.0? The
   maximally-strict reading — every public symbol, every signature,
   every docstring first line — taxes ongoing maintenance heavily.
   The maximally-loose reading — only top-level `__all__` — gives consumers
   a weak guarantee and forces submodule users to absorb refactoring as
   surprise breaking changes.

2. **Gate 3 methodology evidence.** The original v1.0 plan called for
   "methodology docs peer-reviewed by an external reader (statistics /
   methodology background, ideally not part of the `prompt_injection_*`
   core team)." For a single-author / single-consumer / one-month
   release window, identifying an external academic reviewer with the
   right expertise is high-variance calendar work. The plan-time
   alternative — multi-model LLM cross-review (author + Codex + Gemini)
   — is operationally cheaper but not the same evidence class.

Round 5 audit (Codex F7) flagged the second issue specifically: the plan
as originally drafted understated the difference between external human
peer review and model-assisted cross-review.

## Decision

Two interlinked decisions, finalized here:

### 1. Tiered stability contract (Decision M from the v1.0 plan)

v1.0 commits to **three tiers** of stability, each with explicit
SemVer-version cost-to-change:

#### Tier 1: STRICT (SemVer-major to change)

The following are part of the v1.0 strict contract and require a v2.0
bump to alter:

- **Symbols in `eval_toolkit.__all__`** (i.e., everything in `_EXPORTS`
  at v1.0 tag time). Examples: `scorecard`, `Scorecard`, `MetricSpec`,
  `MetricResult`, `bootstrap_ci`, `BootstrapCI`, `evaluate`,
  `RunManifest`, `LogisticStacker`, `MetaLearner`, etc.
- **Their signatures** — parameter names, defaults, kwarg-only markers,
  return types. (Amended 2026-06-10, #101/#93: a **strictly appended
  optional parameter** — a new keyword-only kwarg with a default, or a
  new trailing dataclass field with a default that does not perturb
  positional construction, equality/hash semantics, or serialized
  schemas — is **SemVer-MINOR**, provided the public-API snapshot is
  regenerated in the same commit so the change is reviewed, not
  accidental. Any other signature change — renames, removals,
  reorderings, default changes, return-type changes, non-trailing or
  non-defaulted field insertions — remains SemVer-major. This codifies
  the practice already shipped at v1.5.0 (`HFDatasetsLoader` kwargs;
  recorded as erratum #101) and v1.9.0 (#93 `return_samples` /
  `BootstrapCI.samples`). The original text routed every signature
  change to "the additive-Protocol path or accept the major bump",
  which provides no path at all for functions and dataclasses.)
- **Tier-2 Protocols** (9 strict at v1.0, all shipped through v0.47.0;
  10 strict since v1.0.2 — see the `SimilarityStrategy` entry below):
  - `Scorer` (from `protocols.py`)
  - `LeakageCheck` (from `leakage.py`)
  - `Splitter` (from `splits.py`)
  - `ThresholdSelector` (from `thresholds.py`)
  - `DatasetLoader` (from `loaders.py`)
  - `MetricSpec` (from `scorecards.py` since v0.49.0; was `_scorecard.py` v0.46–v0.48)
  - `TextTransform` (from `protocols.py`, shipped v0.47)
  - `MetaLearner` (from `stacking.py`, shipped v0.45)
  - `Probe` (from `probes.py`, shipped v0.43)
  - `SimilarityStrategy` (from `text_dedup.py`; promoted to the 10th
    strict Protocol at v1.0.2 per the #76 RC2 reconciliation — a
    documentation-only promotion, the contract was already pinned in
    the public-API snapshot since v1.0.0. Amended here 2026-06-09.)
  - PLUS 1 opt-in Protocol: `Versioned` (additive on top of Tier-2).

  Tier-2 Protocol method signatures are pinned by the Decision R6-D drift
  guard in `tests/test_public_api.py` (since v0.47.0): the golden
  snapshot captures each Protocol's method signatures + return types via
  `typing.get_type_hints` + `inspect.signature`, so a SemVer-major
  review fires on any shape change.
- **Current versioned JSON schemas** (`src/eval_toolkit/schemas/*.json`)
  per artifact type — `manifest.v3.json` (canonical), `manifest.v1.json`
  + `manifest.v2.json` (historical, kept for migration),
  `ood_manifest.v1.json`, `results.v1.json`, `results_full.v1.json`.
  Schema-version bumps within the v1.x line are additive only (e.g.,
  publishing `manifest.v4.json` is fine; renaming required fields in
  `manifest.v3.json` is not).

#### Tier 2: ADDITIVE-ONLY (SemVer-minor for changes)

The following may grow in v1.x minors but may not lose existing
functionality:

- **Submodule public symbols** (e.g., `eval_toolkit.metrics.pr_auc`,
  `eval_toolkit.bootstrap.paired_bootstrap_diff`,
  `eval_toolkit.calibration.fit_platt_calibrator`). These are
  documented as **internal API** per ADR 0002 — not part of Tier 1's
  strict freeze, but eval-toolkit commits to **not removing** them
  during v1.x. Refactor that adds new helpers + keeps old paths
  importable is acceptable.
- **Tier-2 Protocols can gain optional methods via subprotocols** —
  e.g., a future `MetricSpec` subprotocol could add
  `is_defined_on(y_true) -> tuple[bool, str]` for richer cell-state
  detection. Implementations that don't implement the subprotocol
  continue to satisfy the base Protocol; this is additive.
- **Amendment (2026-06-12, v1.12.0):** Tier-2 surfaces may change
  beyond strictly-additive (parameter renames; eventual removals) in a
  SemVer-MINOR **only** through the deprecation process documented in
  `docs/source/DEPRECATION.md` — the standard 2-minor warning window,
  or its one-time-exception criteria (small verified consumer set,
  cross-repo notification, alias-debt justification, announce-time
  documentation). The symbol itself must remain importable through
  v1.x. First use: the `eval_toolkit.eda` SPEC-7 parameter renames at
  v1.12.0 (#100, exception path).

#### Tier 3: FREE (SemVer-patch)

The following may change in any release:

- **Docstring first lines** — `tests/test_public_api.py` captures
  them as part of the public-API snapshot. (Amended 2026-06-09: the
  v1.0 plan to drop this capture — or gate it behind a
  `STRICT_DOCSTRINGS=1` env var — was never implemented. Through
  v1.x, docstring first lines remain pinned, so docstring polish
  requires a snapshot regen in the same commit; the SemVer *tier* of
  such a change is still Tier-3/patch.)
- **Implementation internals** (helpers without `__all__` entries,
  private `_-prefixed` symbols, module-level constants not in any
  `__all__`). These may move, rename, or disappear without a SemVer
  signal.
- **Error message wording**. The error TYPES (ValueError vs RuntimeError)
  are stable; the human-readable message text is not.

### 2. Gate 3 methodology cross-review

Gate 3 at v1.0 is **internal model-assisted cross-review**, **NOT**
external academic peer review. The process:

1. **Manual review by the author.** Author reads
   `docs/source/methodology/` (16 chapters) + new feature pages +
   ADRs 0001 / 0002 / 0003 / this file + the v1.0 plan with a critic's
   eye. The author is NOT an external reviewer; this is the
   highest-context but weakest-independence of the three reads.
2. **Codex independent report.** Provide the methodology curriculum +
   plan + relevant source files; ask for a methodology-focused review
   identifying gaps, contradictions, unstated assumptions, and
   statistical-correctness issues. Codex's training corpus differs from
   Anthropic's, so it surfaces issues Anthropic-trained reasoning may
   miss.
3. **Gemini independent report.** Same packet, different model. Google's
   training corpus differs from both Anthropic's and OpenAI's, so it
   provides a third independent read.

**Honest framing** (per Audit F7): this is **not** equivalent to external
academic peer review.

- **What multi-LLM cross-review catches well** (demonstrated in Round 5):
  plan/code contradictions, references to symbols that don't exist,
  doc-code drift, load-bearing instruction errors in implementation
  directives, mathematical claims that don't match implementation,
  public-API status contradictions. Round 5 surfaced 12 verified-real
  findings before any v0.46 code shipped.
- **What multi-LLM cross-review catches less reliably**: methodological
  judgments that depend on domain expertise outside the models' training
  corpora, deep statistical correctness on novel methods, whether the
  chosen methodology serves the practitioner's actual decision-making
  need.
- **What it does NOT substitute for**: external accountability. There is
  no third party who will be embarrassed if v1.0 turns out to have a
  methodology bug.

**Cycle structure** (per Decision Y.2):

- Round 5 (pre-v0.46 implementation) — **complete** 2026-05-21.
  Surfaced 12 verified-real findings, including F1 / F2 (scorecard
  shape blockers) and F7 (Gate 3 honesty — this ADR's framing).
- Round 6 (post-v0.46 ship) — **complete** 2026-05-21. STOP-GATE
  CLOSED before v0.47 opened. Codex R6-F1 (ECE strategy validation)
  + Codex R6-F2 (deprecation warning content) shipped as v0.46.1
  hotfix per Decision R6-E; the remaining findings (R6-A through R6-H)
  folded into v0.47.0.
- Round 7 (post-v0.47 ship) — **complete** 2026-05-21. STOP-GATE
  CLOSED before v0.48 opened. Codex R7-F1 (MyST-NB doc-execution gap)
  + R7-F2 (sweep strategy_id) + R7-F3 (sweep scorer output shape)
  folded into v0.48.0; Gemini observations folded into v0.48
  documentation polish (§5K) + the Makefile pre-push target (§5L).
- Round 8 (post-v0.48 ship) — final pre-v1.0 packet review.
- v1.0 tag requires Round 8 to close (or fall back per ADR amendment).

**Findings ledger**: [`docs/source/audit_findings.md`](../audit_findings.md)
records each round + the disposition of every blocker-severity finding.
Blocker findings also get a `p1-gate3`-labelled GitHub issue for
fix-tracking.

**Escalation path** (if a Round flags methodological judgment beyond
LLM capability): consult a human reviewer for the specific narrow
question. Don't require a full-curriculum external read; do require a
human signoff on any methodology claim that surfaces as "models disagree"
or "models flag uncertainty without confidence."

## Consequences

**Positive:**

- The tiered contract gives consumers a clear answer to "what can I rely
  on?" — Tier 1 is bedrock; Tier 2 is documented-as-additive; Tier 3 is
  understood-to-evolve.
- Submodule users (the `eval_toolkit.metrics.*` escape hatch) get an
  explicit additive-only commitment for v1.x — not the maximally-loose
  "no commitment" of a narrow contract.
- Gate 3 honesty: future maintainers can read this ADR and understand
  exactly what evidence the v1.0 release carries (and doesn't).
- Rounds 5 / 6 / 7 demonstrated that multi-LLM cross-review catches
  real issues at each stage (Round 5: 12 findings, Round 6: 11 findings,
  Round 7: 3 substantive Codex findings + 6 Gemini observations); the
  cycle structure ensures coverage of every breaking minor before the
  final stability tag.

**Negative:**

- The Tier 1 commitment to Tier-2 Protocol shapes makes future
  Protocol-evolution work harder. v1.x can ADD subprotocols but cannot
  MODIFY base method signatures without v2.0.
- The tiered contract is more documentation surface to maintain than a
  flat "everything is strict" or "only `__all__` is strict" contract.
- Multi-LLM cross-review has no formal external-accountability story.
  If a v1.0 methodology bug surfaces post-release, no third party shares
  responsibility for missing it.

## Alternatives considered

### Maximally-strict contract (every public symbol, signature, docstring)

Reject because the docstring-first-line capture would tax every
documentation-polish PR with a SemVer concern. Test snapshot would
trigger on every doc improvement.

### Maximally-loose contract (only top-level `__all__`)

Reject because the submodule escape hatch is real and used. Without a
commitment, consumer code at `from eval_toolkit.metrics import
expected_calibration_error` could break without SemVer warning, defeating
the consumer's purpose for pinning to a stable version.

### Strict + flat (no tiering, freeze submodule too)

Reject because the v1.x line needs room to grow. The 4-binary-calibrator
family added across v0.35 → v0.42 was exactly the kind of additive
submodule evolution the flat-strict contract would have penalized.

### External academic peer review as required Gate 3

Reject because the single-author, one-month release window made
reviewer-identification calendar risk a hard blocker. The Round 5 audit
demonstrated multi-LLM cross-review surfaces real findings (12 verified)
quickly; the trade-off (no external accountability) is documented here
rather than overclaimed.

### Defer Gate 3 entirely (self-review only)

Reject because Round 5 demonstrated that even at the author's level of
familiarity with the plan, Codex + Gemini each caught issues the author
missed. Removing the cross-review step would surrender that benefit.

## Trigger to revisit

This ADR is locked at v1.0. Revisiting requires **SemVer-major (v2.0)** or
explicit ADR amendment. Likely triggers:

- **Second production consumer** with materially different access
  patterns (e.g., heavy submodule use that exposes additive-only
  ambiguity).
- **A v1.x methodology bug** that multi-LLM cross-review missed but
  external review would have caught. In that case, re-tighten Gate 3
  for v2.0+ to require human review for specific claim types.
- **Schema bumps** beyond v3 (manifest schema is currently at v3). If
  v1.x publishes `manifest.v4.json` as canonical, this ADR's schema-list
  is amended (additive change to Tier 1 — does NOT require v2.0).

## References

- v1.0 plan: `~/.claude/plans/evaluate-all-the-work-twinkly-kite.md`
  (Decisions M + O).
- Round 5 audit ledger: [`audit_findings.md`](../audit_findings.md) —
  Audit F7 (Gate 3 honesty) is the key finding for this ADR.
- [ADR 0001 — flat module layout](0001-flat-module-layout.md) — the
  module-layout contract that operates inside the Tier-1 stability
  commitment defined here.
- [ADR 0002 — scorecard as primary metric surface](0002-scorecard-as-primary-metric-surface.md)
  — adds `MetricSpec` to Tier-2 Protocols list above; demotes
  `eval_toolkit.metrics.*` to the Tier-2 additive-only commitment.
- `tests/test_public_api.py` — captures docstring first lines; the
  planned v1.0 skip was not implemented (see the amended Tier-3 note
  above).