# Migrating to v0.51 The v0.51 release is the **Round 8 audit rectification batch** — all 18 verified findings from the multi-LLM cross-review of v0.50.0 land in one BREAKING-allowed minor before v1.0 tags. The audit verification report `audit-verification-codex-gemini-v0.50.0.md` (a repo-root, **gitignored** local record — not committed, so no link) confirmed 13 of 18 claims; refuted 3 (R8-G2, R8-G5, R8-V1+R8-V2 paired); deferred 2 (R8-G3, R8-G4) to v1.x as Tier-2 additive. v0.51 ships fixes for the 13 confirmed claims; deferred items filed in `audit_findings.md` Round 8 ledger. Round 9 multi-LLM cross-review verifies the v0.51 RC before v1.0 tags. If Round 9 lands clean, v1.0 follows v0.51 directly. If you're jumping from v0.49 (or earlier) and have not migrated through v0.50, read `migration/v0.50.md` first. ## What's BREAKING at v0.51 ### 1. R8-C3: `recall_at_fpr` fallback sentinel When no threshold satisfies `target_fpr`, the fallback now returns a sentinel `RecallAtFprResult(threshold=np.inf, recall=0.0, actual_fpr=0.0, fp=0, tn=n_val_neg)`. Pre-v0.51 the fallback returned `threshold=1.0` and then computed `y_pred = (y_score >= 1.0)`, silently classifying any negative-class sample with score 1.0 as positive — `actual_fpr` returned as 1.0 in violation of the function's target FPR ceiling. **Before v0.51 (buggy):** ```text recall_at_fpr(y=[0,1], scores=[1.0,1.0], target_fpr=0.0) # returned actual_fpr=1.0, fp=1 (violates ceiling) ``` **v0.51 (correct):** ```text recall_at_fpr(y=[0,1], scores=[1.0,1.0], target_fpr=0.0) # returns RecallAtFprResult(threshold=np.inf, actual_fpr=0.0, fp=0, ...) ``` Migration: any caller filtering on `result.threshold` should add an `np.isinf(result.threshold)` branch — pre-v0.51 the unsatisfiable signal was `threshold=1.0`. ### 2. R8-C4a: Generator-rng parallel stability in `_score_all_slices` `harness.evaluate(..., rng=np.random.default_rng(N), n_jobs=2)` now produces bit-identical results to `n_jobs=1`. Pre-v0.51 the same rng object was attached to every `(slice, scorer)` work unit and joblib forked copies at the SAME generator state, so all workers used identical bootstrap sample streams — silent non-independence across pairs and divergence vs sequential mode. v0.51 spawns one independent SeedSequence per work unit at the dispatch boundary. Integer-`rng` callers (the common case) are unaffected. Callers passing `Generator` instances now get reproducible results across `n_jobs` per the SPEC 7 contract. ### 3. R8-C4b: `spawn_seed_sequences` respects Generator state `_rng.spawn_seed_sequences(rng, n)` now draws fresh entropy from the generator via `rng.integers(0, 2**63 - 1, size=n)` and wraps each in a `SeedSequence`. Each call advances generator state, so repeated calls on the same instance yield different children. Pre-v0.51 the function extracted `bit_generator.seed_seq` and called `.spawn(n)` — Generator advancement was ignored. Most callers (passing fresh generators) see no change. Callers who advanced the rng before passing it now get the semantically-correct different children. ### 4. R8-C2: `SourceDisjointKFoldSplitter` k-cap `iter_folds(...)` now caps the fold count at `min(self.k, n_sources)` (matching `get_n_splits(...)`). Pre-v0.51 the loop ran `range(self.k)` and yielded EMPTY test partitions when `k > n_sources` while `get_n_splits` returned `min(k, n_sources)` — the two methods silently disagreed. A `UserWarning` is emitted when the cap fires. Callers that consumed surplus empty-test folds (which was the bug) will see fewer iterations now. ## What's Added at v0.51 (additive) ### R8-C1: `reseed_splitter` callback on `evaluate_folded` ```text from dataclasses import replace evaluate_folded( scorers, splitter, slice_, seeds=(1, 2, 3), reseed_splitter=lambda sp, s: replace(sp, seed=s), ... ) ``` Default `None` preserves the historical replay-folds behavior + emits a `DeprecationWarning` whenever `len(seeds) > 1`. Note: the warning persists past v1.0 because the pre-v1.0 deprecation window is one minor and `DEPRECATION.md` requires ≥2 to close a cycle. ### R8-C6 / F1 / F2 / F3: validation rigor Additive earlier-failure with better diagnostics: - `calibration.reliability_curve` + `maximum_calibration_error` validate `y_score ∈ [0, 1]` (matches metrics.py-side ECE rigor). - `calibration.fit_temperature` validates the `bounds` tuple (finite + positive + `lo < hi`). - `losses.RecallAtLowFPR` validates `pos_weight > 0` at construction. - `metric_specs.ece(n_bins=)` validates `n_bins` eagerly at spec-construction (matches eager `strategy` validation). - `analysis.CsvPredictionReader` detects missing CSV columns at read time → actionable `ValueError` instead of cryptic dtype error downstream. ## What's Fixed at v0.51 (docs / structure) - **R8-C5**: README links repointed from `docs/` (broken) to `docs/source/` (correct). Migration toctree extended with v0.49 + v0.50 + v0.51 entries. - **R8-C8**: SimilarityStrategy demoted from "Tier-2 strict" in README + `extending.md` to "pre-v0.7 internal interface" matching the 9-strict list in `docs/source/api/strict_tier2_protocols.md`. - **R8-G1**: `repo-strategy.md` gains a supersession note pointing at ADR 0001 (flat-module-layout). - **R8-C9**: `claims.GateResult.to_dict` docstring notes the JSON-safety contract — strict-JSON requires `artifacts.write_json_strict` or explicit `sanitize_for_json` call. - **R8-C10**: `.gitignore` audit-artifact patterns extended to cover `codex-comprehensive-audit-*`, `audit-verification-*`, and the per-LLM report aliases. ## Deferred to v1.x - **R8-G3** (custom exception hierarchy beyond `ValueError`): additive Tier-2 change; deferred until a downstream consumer requests it. - **R8-G4** (memory-aware `n_jobs` capping in joblib path): the hazard is documented at `_parallel.py:55-59` but mitigation needs RAM-measurement + dataframe-size accounting — non-trivial. Deferred; caller is responsible for `n_jobs` sizing under their RAM budget. ## Notes - The pre-existing `test_bootstrap_calibration_mc.py` failures (6 tests) on `origin/main` are a v0.50 SPEC-7 migration gap in `_generate_population` helper — unrelated to the v0.51 audit batch. Will be picked up in Round 9 prep. - Round 9 multi-LLM cross-review runs against the v0.51 RC at `gate3-audit-round-9.md` (briefing to be authored at release-prep time).