Migrating to v1.0#
The v1.0 release is a stability-contract activation, not a code
delta from v0.51. Every fix that landed at v0.51 is what v1.0 ships;
the new thing at v1.0 is that the
ADR 0003 Tier 1 / Tier 2 / Tier 3 stability contract
becomes load-bearing. Breaking changes to Tier-1 surfaces after v1.0
require a major bump (v2.0).
If you’re jumping from v0.50 or earlier, read v0.51.md
first — that is where the actual migration-step content lives. This
document recaps what’s locked at v1.0 and lists what’s deferred to
v1.0.1.
What v1.0 locks (ADR 0003 activation)#
Per the stability contract, after v1.0:
Tier 1 STRICT — public-API signatures captured in
tests/golden/public_api/snapshot.json. Any signature drift bumps tov2.0.Tier 2 ADDITIVE — the 9 strict Protocols (
Scorer,LeakageCheck,Splitter,ThresholdSelector,DatasetLoader,MetricSpec,MetaLearner,Probe,TextTransform) + 1 opt-in (Versioned). Method shapes are frozen; new subprotocols / new Protocols may be added.Tier 3 FREE — internal modules (
_rng,_sweep,_parallel, the various_*.pyfiles). Refactors don’t need a major bump.
The four v1.0 gates (real consumer in production, Protocol-shape
review cycle, multi-model methodology cross-review, Croissant interop
end-to-end) are all closed — see
roadmap.md for the gate
ledger.
Round 8 + Round 9 audit closure (recap)#
v1.0 ships with the Round 8 + Round 9 multi-LLM cross-review batch closed:
Round 8 (verified against v0.50.0): 13 confirmed findings → fixed in v0.51; 3 refuted (Gemini over-confidence pattern); 2 deferred to v1.x as Tier-2 additive (custom exceptions; joblib memory-aware capping).
Round 9 (verified against v0.51 RC): 6 confirmed of 10 source items + 3 third-audit findings my Claude verification caught in modules neither auditor cited (
_sweep.py,bootstrap.py,metrics.py). 2 candidate-blocker-tier items fixed in this RC (F-sweep-1 NaN/inf scorer-output validation; F-bootstrap-1 BCa degeneracy warning +mde_from_ciNaN-width guard). 4 minors deferred to v1.0.1.
Full ledger at audit_findings.md Round 8 +
Round 9 sections.
Carried-over deprecations#
The R8-C1 DeprecationWarning on multi-seed evaluate_folded(seeds=...)
calls without an explicit reseed_splitter callback persists past
v1.0 by design. The pre-v1.0 deprecation window is one minor
(v0.51 → v1.0); DEPRECATION.md requires ≥2 minors to close a
deprecation cycle. The warning therefore becomes a permanent
docstring + runtime nudge — single-seed callers see no change;
multi-seed callers should pass reseed_splitter for true seed
variance.
Deferred to v1.0.1#
The following items are filed in the
v1.0.1 cleanup tracking issue
(gh issue list --label tracked --label improvement) and will pick
up in the next minor. All are Tier-2 ADDITIVE or Tier-3 FREE —
the v1.0 Tier-1 contract is not affected.
RC2 —
SimilarityStrategycontract reconciliation: demoted inextending.md+ README to “pre-v0.7 internal interface” but still pinned in__init__.py:__all__and the public-API snapshot. Resolve to a single canonical Tier (3-internal vs 2-additive) and align all surfaces.RC3 —
tests/test_harness_folded.pyR8-C1 reseed_splitter regression test harden: current count-only assertions should compare fold-row indices across seeds.RC4 — v0.51 documentation count ambiguity (“13 confirmed / 3 refuted / 2 deferred” tallies appear with minor variance across
audit_findings.md,migration/v0.51.md, and CHANGELOG headers). Reconcile to a single canonical tally.F-metrics-1 —
brier_scoredocstring precision pass.F-metrics-3 — ECE behavior on uniform / uninformative scores: clarify docs OR add boundary-condition validation.
F-metrics-4 —
brier_scoresingle-class edge-case docstring ambiguity.
Carried forward from earlier rounds:
R8-G3 — custom exception hierarchy beyond
ValueError(additive Tier-2). Deferred until a downstream consumer requests it.R8-G4 — joblib memory-aware
n_jobscapping. Caller-owned per documented hazard at_parallel.py; non-trivial mitigation (RAM measurement + DataFrame-size accounting) — deferred indefinitely.
What you should do on upgrade#
If your consumer is on v0.51, nothing. v1.0 is bit-equivalent to v0.51 for all public behavior.
If your consumer is on v0.50 or earlier, follow v0.51.md
for the actual migration steps. The migration sequence is
v0.49 → v0.50 → v0.51 → v1.0; each step has its own guide.
If you depend on eval-toolkit in a downstream project, pin
>=1.0,<2.0 to opt into the stability contract. Tier-1 breakages
after v1.0 will land in v2.0.
See also#
roadmap.md— v1.0 gates ledger + post-v1.0 forward-look.audit_findings.md— full Round 5 → Round 9 audit history.adr/0003-stability-contract-and-gate3-methodology.md— the contract this tag locks in.MIGRATION.md— version-to-version migration index.