Migrating to v1.0#

The v1.0 release is a stability-contract activation, not a code delta from v0.51. Every fix that landed at v0.51 is what v1.0 ships; the new thing at v1.0 is that the ADR 0003 Tier 1 / Tier 2 / Tier 3 stability contract becomes load-bearing. Breaking changes to Tier-1 surfaces after v1.0 require a major bump (v2.0).

If you’re jumping from v0.50 or earlier, read v0.51.md first — that is where the actual migration-step content lives. This document recaps what’s locked at v1.0 and lists what’s deferred to v1.0.1.

What v1.0 locks (ADR 0003 activation)#

Per the stability contract, after v1.0:

  • Tier 1 STRICT — public-API signatures captured in tests/golden/public_api/snapshot.json. Any signature drift bumps to v2.0.

  • Tier 2 ADDITIVE — the 9 strict Protocols (Scorer, LeakageCheck, Splitter, ThresholdSelector, DatasetLoader, MetricSpec, MetaLearner, Probe, TextTransform) + 1 opt-in (Versioned). Method shapes are frozen; new subprotocols / new Protocols may be added.

  • Tier 3 FREE — internal modules (_rng, _sweep, _parallel, the various _*.py files). Refactors don’t need a major bump.

The four v1.0 gates (real consumer in production, Protocol-shape review cycle, multi-model methodology cross-review, Croissant interop end-to-end) are all closed — see roadmap.md for the gate ledger.

Round 8 + Round 9 audit closure (recap)#

v1.0 ships with the Round 8 + Round 9 multi-LLM cross-review batch closed:

  • Round 8 (verified against v0.50.0): 13 confirmed findings → fixed in v0.51; 3 refuted (Gemini over-confidence pattern); 2 deferred to v1.x as Tier-2 additive (custom exceptions; joblib memory-aware capping).

  • Round 9 (verified against v0.51 RC): 6 confirmed of 10 source items + 3 third-audit findings my Claude verification caught in modules neither auditor cited (_sweep.py, bootstrap.py, metrics.py). 2 candidate-blocker-tier items fixed in this RC (F-sweep-1 NaN/inf scorer-output validation; F-bootstrap-1 BCa degeneracy warning + mde_from_ci NaN-width guard). 4 minors deferred to v1.0.1.

Full ledger at audit_findings.md Round 8 + Round 9 sections.

Carried-over deprecations#

The R8-C1 DeprecationWarning on multi-seed evaluate_folded(seeds=...) calls without an explicit reseed_splitter callback persists past v1.0 by design. The pre-v1.0 deprecation window is one minor (v0.51 → v1.0); DEPRECATION.md requires ≥2 minors to close a deprecation cycle. The warning therefore becomes a permanent docstring + runtime nudge — single-seed callers see no change; multi-seed callers should pass reseed_splitter for true seed variance.

Deferred to v1.0.1#

The following items are filed in the v1.0.1 cleanup tracking issue (gh issue list --label tracked --label improvement) and will pick up in the next minor. All are Tier-2 ADDITIVE or Tier-3 FREE — the v1.0 Tier-1 contract is not affected.

  • RC2SimilarityStrategy contract reconciliation: demoted in extending.md + README to “pre-v0.7 internal interface” but still pinned in __init__.py:__all__ and the public-API snapshot. Resolve to a single canonical Tier (3-internal vs 2-additive) and align all surfaces.

  • RC3tests/test_harness_folded.py R8-C1 reseed_splitter regression test harden: current count-only assertions should compare fold-row indices across seeds.

  • RC4 — v0.51 documentation count ambiguity (“13 confirmed / 3 refuted / 2 deferred” tallies appear with minor variance across audit_findings.md, migration/v0.51.md, and CHANGELOG headers). Reconcile to a single canonical tally.

  • F-metrics-1brier_score docstring precision pass.

  • F-metrics-3 — ECE behavior on uniform / uninformative scores: clarify docs OR add boundary-condition validation.

  • F-metrics-4brier_score single-class edge-case docstring ambiguity.

Carried forward from earlier rounds:

  • R8-G3 — custom exception hierarchy beyond ValueError (additive Tier-2). Deferred until a downstream consumer requests it.

  • R8-G4 — joblib memory-aware n_jobs capping. Caller-owned per documented hazard at _parallel.py; non-trivial mitigation (RAM measurement + DataFrame-size accounting) — deferred indefinitely.

What you should do on upgrade#

If your consumer is on v0.51, nothing. v1.0 is bit-equivalent to v0.51 for all public behavior.

If your consumer is on v0.50 or earlier, follow v0.51.md for the actual migration steps. The migration sequence is v0.49 → v0.50 → v0.51 → v1.0; each step has its own guide.

If you depend on eval-toolkit in a downstream project, pin >=1.0,<2.0 to opt into the stability contract. Tier-1 breakages after v1.0 will land in v2.0.

See also#