Migrating to v0.48#
The v0.48 release is the last polish minor before v1.0. It closes the
v1.0 sprint’s “polish + audit-driven tightening” theme: Round 7 audit
follow-on, the v0.46 BootstrapCI.to_dict() schema rewrite, cross-API
shape-validation consistency, and the v0.48 §5E-prep packet-drift fixes
to the methodology docs.
If you’re jumping from v0.46 (or earlier) and have not yet migrated
through v0.47, read migration/v0.47.md first.
What’s BREAKING at v0.48#
1. BootstrapCI.to_dict() + PairedBootstrapCI.to_dict() schema rewrite#
The pre-v0.48 schema hard-coded a "ci_95" key regardless of the
actual confidence field. At confidence=0.90 the output looked like:
{"point_estimate": 0.5, "ci_95": [0.4, 0.6], "confidence": 0.90, ...}
The "ci_95" key contradicted the "confidence" field. v0.48 names
the bounds neutrally; consumers interpret semantics from the
confidence field.
Before v0.48:
ci.to_dict()
# {"point_estimate": 0.5, "ci_95": [0.4, 0.6], "confidence": 0.95,
# "n_resamples": 1000, "method": "BCa"}
v0.48+:
from eval_toolkit.bootstrap import BootstrapCI
ci = BootstrapCI(
point_estimate=0.5, ci_low=0.4, ci_high=0.6,
confidence=0.95, n_resamples=1000, method="BCa",
)
ci.to_dict()
# {"point": 0.5, "low": 0.4, "high": 0.6, "confidence": 0.95,
# "n_resamples": 1000, "method": "BCa"}
Migration: rename point_estimate → point; replace the
ci_95 list-of-two with separate low + high keys.
# Before (illustrative — will fail at v0.48+):
d = ci.to_dict()
p = d["point_estimate"]
lo, hi = d["ci_95"]
# After:
d = ci.to_dict()
p = d["point"]
lo, hi = d["low"], d["high"]
Same rewrite applies to PairedBootstrapCI.to_dict():
# Before: {"delta": 0.1, "ci_95": [0.05, 0.15], "overlaps_zero": False, ...}
# After: {"delta": 0.1, "low": 0.05, "high": 0.15, "overlaps_zero": False, ...}
2. sweep() adds strategy_id column + rejects duplicates#
The sweep() DataFrame schema grew by one column (strategy_id,
inserted between text_id and variant):
Before v0.48:
columns: text_id, variant, transformed_text[, original_score,
transformed_score, asr]
v0.48+:
columns: text_id, strategy_id, variant, transformed_text[, original_score,
transformed_score, asr]
strategy_id is a canonical per-row identifier built from the
strategy’s configured kwargs (e.g.,
"delimit/delimiter='<<',end='>>'"). It exists so downstream analysis
can disambiguate two configured instances of the same dataclass that
share .name. variant keeps the pre-v0.48 shape for backward-compat
groupby queries.
Callers indexing the DataFrame by column position must re-check offsets. Callers indexing by column name are unaffected.
sweep() now rejects two strategies that produce the same
strategy_id:
# Illustrative — this CALL deliberately raises at v0.48+ to surface
# the silent-merge anti-pattern that pre-v0.48 hid:
from eval_toolkit import sweep, DelimitVariant
sweep([DelimitVariant(), DelimitVariant()], ["hello"])
# ValueError: sweep(): duplicate strategy_id "delimit/..." at index 1
# (previously at index 0); each strategy must produce a unique
# strategy_id. If you want two configurations of the same
# dataclass in the same sweep, vary their kwargs so the
# canonical identifier differs.
If you want to sweep over multiple configurations of the same dataclass, vary the kwargs (the canonical pattern — this one executes cleanly):
from eval_toolkit import sweep, DelimitVariant
texts = ["hello", "world"]
df = sweep(
[DelimitVariant(delimiter="<<"), DelimitVariant(delimiter="[[")],
texts,
)
# Both rows survive; strategy_id distinguishes them. df.groupby("strategy_id")
# is the canonical disambiguation pattern.
print(df["strategy_id"].unique().tolist())
3. sweep() validates scorer output shape#
A Scorer that returns a wrong-shape array now raises an API-level
ValueError at the sweep() boundary:
# Illustrative — this CALL deliberately raises at v0.48+:
import numpy as np
from eval_toolkit import sweep, DelimitVariant
class _BadScorer:
def predict_proba(self, X):
return np.array([0.5] * (len(X) + 1)) # one too many scores
# v0.48 raises immediately at the sweep boundary:
sweep([DelimitVariant()], ["a", "b"], scorer=_BadScorer(), attack_threshold=0.5)
# ValueError: sweep(): scorer.predict_proba(original-texts batch) returned
# shape (3,); expected (2,). The Scorer Protocol requires one
# float P(positive) per input row...
Pre-v0.48: silent truncation (overlong), IndexError (short), or
TypeError (matrix-shaped) — all low-level numpy errors that didn’t
identify the offending scorer call.
What’s added at v0.48 (additive — no migration needed)#
make pre-pushtarget — local-dev gate that mirrors CI’s 3 doc- execution surfaces (Sybil + MyST-NB +--doctest-modules). The Sub-PR-7 incident postmortem (feedback_sybil_python_blocks) motivates this —pytest tests/silently overridestestpathsand drops 159 Sybil items from collection.make pre-pushruns without the positional path arg so all three surfaces stay covered.nb_execution_raise_on_error = Trueindocs/source/conf.py— docs CI now fails on notebook execution errors instead of leaving them as advisory warnings (Decision R7-A; closes R7-F1)..doctest-modulesexpanded from 11 → 21 modules.make test+ CI now catch future drift in 10 additional modules’ in-source docstring examples.ADR 0001 (flat-module layout, finalized) + ADR 0003 (stability contract + Gate 3 methodology, finalized).
Standardized
ImportErrormessages across all lazy-extras surfaces. EveryImportErrorraise now follows the canonical template:"<feature> requires <pkg>. Install with: pip install eval-toolkit[<extra>]".Cross-API shape-validation consistency —
metrics_at_threshold,paired_bootstrap_op_point_diff,bootstrap_metric_from_predictions, themetrics.pyscalars, and thefit_*_binarycalibrator family all now validate input shape at their API boundaries with contextualValueErrors (no low-level numpy errors leaking).paired_bootstrap_op_point_diffdefensive guard — passing the same array forval_y+test_yraisesValueError(Round 5 R5-F6e finding; the two-level bootstrap assumes disjoint partitions).Documentation polish —
SynonymSubstitutionwhitelistNotessection;Scorecard.to_pandas()dtype coercionNotes;CostSensitiveSelectorcalibrated-prior warning; Round 5 packet- drift fixes across 7 methodology pages.
Migration checklist#
Before bumping the pin to eval-toolkit==0.48.0:
Replace
d["point_estimate"]→d["point"]; replaced["ci_95"]→(d["low"], d["high"])everywhere you consumeBootstrapCI.to_dict()orPairedBootstrapCI.to_dict()output.Audit
sweep()callsites for column-position indexing — the DataFrame now has 4 columns before the optional scorer columns (was 3). Switch to column-name indexing if you weren’t already.Audit
sweep()callsites for intentional duplicate-instance sweeps. If you pass the same configured strategy twice, either remove the duplicate or vary the kwargs.If any of your
Scoreradapters return wrong-shape arrays (especially silent overlong), fix them —sweep()now refuses to silently truncate.Run your test suite against the new pin; the v0.47→v0.48 transition surfaces every removed callsite as a
ValueErrororKeyErrorat runtime.
What’s next (v1.0 stability commitment)#
After v0.48 ships and observes ≥1 consumer cycle, the Round 8 audit
STOP-GATE (Decision Y.2) opens. Final Codex + Gemini pass against
the complete pre-v1.0 packet, then v1.0.0:
No new code at v1.0 — content-identical to v0.48 modulo the version bump + roadmap edits + ADR finalization confirmation.
All 4 v1.0 gates closed: Gate 1 (consumer cycle), Gate 2 (Protocol stability), Gate 3 (multi-model cross-review), Gate 4 (Croissant e2e — already MET at v0.41).
See the v1.0 sprint plan at
~/.claude/plans/evaluate-all-the-work-twinkly-kite.md for the full
release sequence + locked decisions A–Z + R6-A through R6-H + R7-A
through R7-C.