Migrating to v0.51#
The v0.51 release is the Round 8 audit rectification batch —
all 18 verified findings from the multi-LLM cross-review of v0.50.0
land in one BREAKING-allowed minor before v1.0 tags. The audit
verification report audit-verification-codex-gemini-v0.50.0.md
(a repo-root, gitignored local record — not committed, so no link)
confirmed 13 of 18 claims; refuted 3 (R8-G2,
R8-G5, R8-V1+R8-V2 paired); deferred 2 (R8-G3, R8-G4) to v1.x as
Tier-2 additive. v0.51 ships fixes for the 13 confirmed claims;
deferred items filed in audit_findings.md Round 8 ledger.
Round 9 multi-LLM cross-review verifies the v0.51 RC before v1.0 tags. If Round 9 lands clean, v1.0 follows v0.51 directly.
If you’re jumping from v0.49 (or earlier) and have not migrated
through v0.50, read migration/v0.50.md first.
What’s BREAKING at v0.51#
1. R8-C3: recall_at_fpr fallback sentinel#
When no threshold satisfies target_fpr, the fallback now returns
a sentinel RecallAtFprResult(threshold=np.inf, recall=0.0, actual_fpr=0.0, fp=0, tn=n_val_neg). Pre-v0.51 the fallback
returned threshold=1.0 and then computed y_pred = (y_score >= 1.0),
silently classifying any negative-class sample with score 1.0 as
positive — actual_fpr returned as 1.0 in violation of the function’s
target FPR ceiling.
Before v0.51 (buggy):
recall_at_fpr(y=[0,1], scores=[1.0,1.0], target_fpr=0.0)
# returned actual_fpr=1.0, fp=1 (violates ceiling)
v0.51 (correct):
recall_at_fpr(y=[0,1], scores=[1.0,1.0], target_fpr=0.0)
# returns RecallAtFprResult(threshold=np.inf, actual_fpr=0.0, fp=0, ...)
Migration: any caller filtering on result.threshold should add an
np.isinf(result.threshold) branch — pre-v0.51 the unsatisfiable
signal was threshold=1.0.
2. R8-C4a: Generator-rng parallel stability in _score_all_slices#
harness.evaluate(..., rng=np.random.default_rng(N), n_jobs=2) now
produces bit-identical results to n_jobs=1. Pre-v0.51 the same
rng object was attached to every (slice, scorer) work unit and
joblib forked copies at the SAME generator state, so all workers
used identical bootstrap sample streams — silent non-independence
across pairs and divergence vs sequential mode. v0.51 spawns one
independent SeedSequence per work unit at the dispatch boundary.
Integer-rng callers (the common case) are unaffected. Callers
passing Generator instances now get reproducible results across
n_jobs per the SPEC 7 contract.
3. R8-C4b: spawn_seed_sequences respects Generator state#
_rng.spawn_seed_sequences(rng, n) now draws fresh entropy from
the generator via rng.integers(0, 2**63 - 1, size=n) and wraps
each in a SeedSequence. Each call advances generator state, so
repeated calls on the same instance yield different children.
Pre-v0.51 the function extracted bit_generator.seed_seq and
called .spawn(n) — Generator advancement was ignored.
Most callers (passing fresh generators) see no change. Callers who advanced the rng before passing it now get the semantically-correct different children.
4. R8-C2: SourceDisjointKFoldSplitter k-cap#
iter_folds(...) now caps the fold count at min(self.k, n_sources) (matching get_n_splits(...)). Pre-v0.51 the loop ran
range(self.k) and yielded EMPTY test partitions when
k > n_sources while get_n_splits returned min(k, n_sources) —
the two methods silently disagreed. A UserWarning is emitted when
the cap fires.
Callers that consumed surplus empty-test folds (which was the bug) will see fewer iterations now.
What’s Added at v0.51 (additive)#
R8-C1: reseed_splitter callback on evaluate_folded#
from dataclasses import replace
evaluate_folded(
scorers, splitter, slice_,
seeds=(1, 2, 3),
reseed_splitter=lambda sp, s: replace(sp, seed=s),
...
)
Default None preserves the historical replay-folds behavior +
emits a DeprecationWarning whenever len(seeds) > 1. Note: the
warning persists past v1.0 because the pre-v1.0 deprecation window
is one minor and DEPRECATION.md requires ≥2 to close a cycle.
R8-C6 / F1 / F2 / F3: validation rigor#
Additive earlier-failure with better diagnostics:
calibration.reliability_curve+maximum_calibration_errorvalidatey_score ∈ [0, 1](matches metrics.py-side ECE rigor).calibration.fit_temperaturevalidates theboundstuple (finite + positive +lo < hi).losses.RecallAtLowFPRvalidatespos_weight > 0at construction.metric_specs.ece(n_bins=)validatesn_binseagerly at spec-construction (matches eagerstrategyvalidation).analysis.CsvPredictionReaderdetects missing CSV columns at read time → actionableValueErrorinstead of cryptic dtype error downstream.
What’s Fixed at v0.51 (docs / structure)#
R8-C5: README links repointed from
docs/(broken) todocs/source/(correct). Migration toctree extended with v0.49 + v0.50 + v0.51 entries.R8-C8: SimilarityStrategy demoted from “Tier-2 strict” in README +
extending.mdto “pre-v0.7 internal interface” matching the 9-strict list indocs/source/api/strict_tier2_protocols.md.R8-G1:
repo-strategy.mdgains a supersession note pointing at ADR 0001 (flat-module-layout).R8-C9:
claims.GateResult.to_dictdocstring notes the JSON-safety contract — strict-JSON requiresartifacts.write_json_strictor explicitsanitize_for_jsoncall.R8-C10:
.gitignoreaudit-artifact patterns extended to covercodex-comprehensive-audit-*,audit-verification-*, and the per-LLM report aliases.
Deferred to v1.x#
R8-G3 (custom exception hierarchy beyond
ValueError): additive Tier-2 change; deferred until a downstream consumer requests it.R8-G4 (memory-aware
n_jobscapping in joblib path): the hazard is documented at_parallel.py:55-59but mitigation needs RAM-measurement + dataframe-size accounting — non-trivial. Deferred; caller is responsible forn_jobssizing under their RAM budget.
Notes#
The pre-existing
test_bootstrap_calibration_mc.pyfailures (6 tests) onorigin/mainare a v0.50 SPEC-7 migration gap in_generate_populationhelper — unrelated to the v0.51 audit batch. Will be picked up in Round 9 prep.Round 9 multi-LLM cross-review runs against the v0.51 RC at
gate3-audit-round-9.md(briefing to be authored at release-prep time).