Migrating to v0.51#

The v0.51 release is the Round 8 audit rectification batch — all 18 verified findings from the multi-LLM cross-review of v0.50.0 land in one BREAKING-allowed minor before v1.0 tags. The audit verification report audit-verification-codex-gemini-v0.50.0.md (a repo-root, gitignored local record — not committed, so no link) confirmed 13 of 18 claims; refuted 3 (R8-G2, R8-G5, R8-V1+R8-V2 paired); deferred 2 (R8-G3, R8-G4) to v1.x as Tier-2 additive. v0.51 ships fixes for the 13 confirmed claims; deferred items filed in audit_findings.md Round 8 ledger.

Round 9 multi-LLM cross-review verifies the v0.51 RC before v1.0 tags. If Round 9 lands clean, v1.0 follows v0.51 directly.

If you’re jumping from v0.49 (or earlier) and have not migrated through v0.50, read migration/v0.50.md first.

What’s BREAKING at v0.51#

1. R8-C3: `recall_at_fpr` fallback sentinel#

When no threshold satisfies target_fpr, the fallback now returns a sentinel RecallAtFprResult(threshold=np.inf, recall=0.0, actual_fpr=0.0, fp=0, tn=n_val_neg). Pre-v0.51 the fallback returned threshold=1.0 and then computed y_pred = (y_score >= 1.0), silently classifying any negative-class sample with score 1.0 as positive — actual_fpr returned as 1.0 in violation of the function’s target FPR ceiling.

Before v0.51 (buggy):

recall_at_fpr(y=[0,1], scores=[1.0,1.0], target_fpr=0.0)
# returned actual_fpr=1.0, fp=1 (violates ceiling)

v0.51 (correct):

recall_at_fpr(y=[0,1], scores=[1.0,1.0], target_fpr=0.0)
# returns RecallAtFprResult(threshold=np.inf, actual_fpr=0.0, fp=0, ...)

Migration: any caller filtering on result.threshold should add an np.isinf(result.threshold) branch — pre-v0.51 the unsatisfiable signal was threshold=1.0.

2. R8-C4a: Generator-rng parallel stability in `_score_all_slices`#

harness.evaluate(..., rng=np.random.default_rng(N), n_jobs=2) now produces bit-identical results to n_jobs=1. Pre-v0.51 the same rng object was attached to every (slice, scorer) work unit and joblib forked copies at the SAME generator state, so all workers used identical bootstrap sample streams — silent non-independence across pairs and divergence vs sequential mode. v0.51 spawns one independent SeedSequence per work unit at the dispatch boundary.

Integer-rng callers (the common case) are unaffected. Callers passing Generator instances now get reproducible results across n_jobs per the SPEC 7 contract.

3. R8-C4b: `spawn_seed_sequences` respects Generator state#

_rng.spawn_seed_sequences(rng, n) now draws fresh entropy from the generator via rng.integers(0, 2**63 - 1, size=n) and wraps each in a SeedSequence. Each call advances generator state, so repeated calls on the same instance yield different children. Pre-v0.51 the function extracted bit_generator.seed_seq and called .spawn(n) — Generator advancement was ignored.

Most callers (passing fresh generators) see no change. Callers who advanced the rng before passing it now get the semantically-correct different children.

4. R8-C2: `SourceDisjointKFoldSplitter` k-cap#

iter_folds(...) now caps the fold count at min(self.k, n_sources) (matching get_n_splits(...)). Pre-v0.51 the loop ran range(self.k) and yielded EMPTY test partitions when k > n_sources while get_n_splits returned min(k, n_sources) — the two methods silently disagreed. A UserWarning is emitted when the cap fires.

Callers that consumed surplus empty-test folds (which was the bug) will see fewer iterations now.

What’s Added at v0.51 (additive)#

R8-C1: `reseed_splitter` callback on `evaluate_folded`#

from dataclasses import replace
evaluate_folded(
    scorers, splitter, slice_,
    seeds=(1, 2, 3),
    reseed_splitter=lambda sp, s: replace(sp, seed=s),
    ...
)

Default None preserves the historical replay-folds behavior + emits a DeprecationWarning whenever len(seeds) > 1. Note: the warning persists past v1.0 because the pre-v1.0 deprecation window is one minor and DEPRECATION.md requires ≥2 to close a cycle.

R8-C6 / F1 / F2 / F3: validation rigor#

Additive earlier-failure with better diagnostics:

calibration.reliability_curve + maximum_calibration_error validate y_score ∈ [0, 1] (matches metrics.py-side ECE rigor).
calibration.fit_temperature validates the bounds tuple (finite + positive + lo < hi).
losses.RecallAtLowFPR validates pos_weight > 0 at construction.
metric_specs.ece(n_bins=) validates n_bins eagerly at spec-construction (matches eager strategy validation).
analysis.CsvPredictionReader detects missing CSV columns at read time → actionable ValueError instead of cryptic dtype error downstream.

What’s Fixed at v0.51 (docs / structure)#

R8-C5: README links repointed from docs/ (broken) to docs/source/ (correct). Migration toctree extended with v0.49 + v0.50 + v0.51 entries.
R8-C8: SimilarityStrategy demoted from “Tier-2 strict” in README + extending.md to “pre-v0.7 internal interface” matching the 9-strict list in docs/source/api/strict_tier2_protocols.md.
R8-G1: repo-strategy.md gains a supersession note pointing at ADR 0001 (flat-module-layout).
R8-C9: claims.GateResult.to_dict docstring notes the JSON-safety contract — strict-JSON requires artifacts.write_json_strict or explicit sanitize_for_json call.
R8-C10: .gitignore audit-artifact patterns extended to cover codex-comprehensive-audit-*, audit-verification-*, and the per-LLM report aliases.

Deferred to v1.x#

R8-G3 (custom exception hierarchy beyond ValueError): additive Tier-2 change; deferred until a downstream consumer requests it.
R8-G4 (memory-aware n_jobs capping in joblib path): the hazard is documented at _parallel.py:55-59 but mitigation needs RAM-measurement + dataframe-size accounting — non-trivial. Deferred; caller is responsible for n_jobs sizing under their RAM budget.

Notes#

The pre-existing test_bootstrap_calibration_mc.py failures (6 tests) on origin/main are a v0.50 SPEC-7 migration gap in _generate_population helper — unrelated to the v0.51 audit batch. Will be picked up in Round 9 prep.
Round 9 multi-LLM cross-review runs against the v0.51 RC at gate3-audit-round-9.md (briefing to be authored at release-prep time).