Migrating to v0.48#

The v0.48 release is the last polish minor before v1.0. It closes the v1.0 sprint’s “polish + audit-driven tightening” theme: Round 7 audit follow-on, the v0.46 BootstrapCI.to_dict() schema rewrite, cross-API shape-validation consistency, and the v0.48 §5E-prep packet-drift fixes to the methodology docs.

If you’re jumping from v0.46 (or earlier) and have not yet migrated through v0.47, read migration/v0.47.md first.

What’s BREAKING at v0.48#

1. `BootstrapCI.to_dict()` + `PairedBootstrapCI.to_dict()` schema rewrite#

The pre-v0.48 schema hard-coded a "ci_95" key regardless of the actual confidence field. At confidence=0.90 the output looked like:

{"point_estimate": 0.5, "ci_95": [0.4, 0.6], "confidence": 0.90, ...}

The "ci_95" key contradicted the "confidence" field. v0.48 names the bounds neutrally; consumers interpret semantics from the confidence field.

Before v0.48:

ci.to_dict()
# {"point_estimate": 0.5, "ci_95": [0.4, 0.6], "confidence": 0.95,
#  "n_resamples": 1000, "method": "BCa"}

v0.48+:

from eval_toolkit.bootstrap import BootstrapCI

ci = BootstrapCI(
    point_estimate=0.5, ci_low=0.4, ci_high=0.6,
    confidence=0.95, n_resamples=1000, method="BCa",
)
ci.to_dict()
# {"point": 0.5, "low": 0.4, "high": 0.6, "confidence": 0.95,
#  "n_resamples": 1000, "method": "BCa"}

Migration: rename point_estimate → point; replace the ci_95 list-of-two with separate low + high keys.

# Before (illustrative — will fail at v0.48+):
d = ci.to_dict()
p = d["point_estimate"]
lo, hi = d["ci_95"]

# After:
d = ci.to_dict()
p = d["point"]
lo, hi = d["low"], d["high"]

Same rewrite applies to PairedBootstrapCI.to_dict():

# Before: {"delta": 0.1, "ci_95": [0.05, 0.15], "overlaps_zero": False, ...}
# After:  {"delta": 0.1, "low": 0.05, "high": 0.15, "overlaps_zero": False, ...}

2. `sweep()` adds `strategy_id` column + rejects duplicates#

The sweep() DataFrame schema grew by one column (strategy_id, inserted between text_id and variant):

Before v0.48:

columns: text_id, variant, transformed_text[, original_score,
         transformed_score, asr]

v0.48+:

columns: text_id, strategy_id, variant, transformed_text[, original_score,
         transformed_score, asr]

strategy_id is a canonical per-row identifier built from the strategy’s configured kwargs (e.g., "delimit/delimiter='<<',end='>>'"). It exists so downstream analysis can disambiguate two configured instances of the same dataclass that share .name. variant keeps the pre-v0.48 shape for backward-compat groupby queries.

Callers indexing the DataFrame by column position must re-check offsets. Callers indexing by column name are unaffected.

sweep() now rejects two strategies that produce the same strategy_id:

# Illustrative — this CALL deliberately raises at v0.48+ to surface
# the silent-merge anti-pattern that pre-v0.48 hid:
from eval_toolkit import sweep, DelimitVariant

sweep([DelimitVariant(), DelimitVariant()], ["hello"])
# ValueError: sweep(): duplicate strategy_id "delimit/..." at index 1
#             (previously at index 0); each strategy must produce a unique
#             strategy_id. If you want two configurations of the same
#             dataclass in the same sweep, vary their kwargs so the
#             canonical identifier differs.

If you want to sweep over multiple configurations of the same dataclass, vary the kwargs (the canonical pattern — this one executes cleanly):

from eval_toolkit import sweep, DelimitVariant

texts = ["hello", "world"]
df = sweep(
    [DelimitVariant(delimiter="<<"), DelimitVariant(delimiter="[[")],
    texts,
)
# Both rows survive; strategy_id distinguishes them. df.groupby("strategy_id")
# is the canonical disambiguation pattern.
print(df["strategy_id"].unique().tolist())

3. `sweep()` validates scorer output shape#

A Scorer that returns a wrong-shape array now raises an API-level ValueError at the sweep() boundary:

# Illustrative — this CALL deliberately raises at v0.48+:
import numpy as np
from eval_toolkit import sweep, DelimitVariant

class _BadScorer:
    def predict_proba(self, X):
        return np.array([0.5] * (len(X) + 1))  # one too many scores

# v0.48 raises immediately at the sweep boundary:
sweep([DelimitVariant()], ["a", "b"], scorer=_BadScorer(), attack_threshold=0.5)
# ValueError: sweep(): scorer.predict_proba(original-texts batch) returned
#             shape (3,); expected (2,). The Scorer Protocol requires one
#             float P(positive) per input row...

Pre-v0.48: silent truncation (overlong), IndexError (short), or TypeError (matrix-shaped) — all low-level numpy errors that didn’t identify the offending scorer call.

What’s added at v0.48 (additive — no migration needed)#

make pre-push target — local-dev gate that mirrors CI’s 3 doc- execution surfaces (Sybil + MyST-NB + --doctest-modules). The Sub-PR-7 incident postmortem (feedback_sybil_python_blocks) motivates this — pytest tests/ silently overrides testpaths and drops 159 Sybil items from collection. make pre-push runs without the positional path arg so all three surfaces stay covered.
nb_execution_raise_on_error = True in docs/source/conf.py — docs CI now fails on notebook execution errors instead of leaving them as advisory warnings (Decision R7-A; closes R7-F1).
.doctest-modules expanded from 11 → 21 modules. make test + CI now catch future drift in 10 additional modules’ in-source docstring examples.
ADR 0001 (flat-module layout, finalized) + ADR 0003 (stability contract + Gate 3 methodology, finalized).
Standardized ImportError messages across all lazy-extras surfaces. Every ImportError raise now follows the canonical template: "<feature> requires <pkg>. Install with: pip install eval-toolkit[<extra>]".
Cross-API shape-validation consistency — metrics_at_threshold, paired_bootstrap_op_point_diff, bootstrap_metric_from_predictions, the metrics.py scalars, and the fit_*_binary calibrator family all now validate input shape at their API boundaries with contextual ValueErrors (no low-level numpy errors leaking).
paired_bootstrap_op_point_diff defensive guard — passing the same array for val_y + test_y raises ValueError (Round 5 R5-F6e finding; the two-level bootstrap assumes disjoint partitions).
Documentation polish — SynonymSubstitution whitelist Notes section; Scorecard.to_pandas() dtype coercion Notes; CostSensitiveSelector calibrated-prior warning; Round 5 packet- drift fixes across 7 methodology pages.

Migration checklist#

Before bumping the pin to eval-toolkit==0.48.0:

Replace d["point_estimate"] → d["point"]; replace d["ci_95"] → (d["low"], d["high"]) everywhere you consume BootstrapCI.to_dict() or PairedBootstrapCI.to_dict() output.
Audit sweep() callsites for column-position indexing — the DataFrame now has 4 columns before the optional scorer columns (was 3). Switch to column-name indexing if you weren’t already.
Audit sweep() callsites for intentional duplicate-instance sweeps. If you pass the same configured strategy twice, either remove the duplicate or vary the kwargs.
If any of your Scorer adapters return wrong-shape arrays (especially silent overlong), fix them — sweep() now refuses to silently truncate.
Run your test suite against the new pin; the v0.47→v0.48 transition surfaces every removed callsite as a ValueError or KeyError at runtime.

What’s next (v1.0 stability commitment)#

After v0.48 ships and observes ≥1 consumer cycle, the Round 8 audit STOP-GATE (Decision Y.2) opens. Final Codex + Gemini pass against the complete pre-v1.0 packet, then v1.0.0:

No new code at v1.0 — content-identical to v0.48 modulo the version bump + roadmap edits + ADR finalization confirmation.
All 4 v1.0 gates closed: Gate 1 (consumer cycle), Gate 2 (Protocol stability), Gate 3 (multi-model cross-review), Gate 4 (Croissant e2e — already MET at v0.41).

See the v1.0 sprint plan at ~/.claude/plans/evaluate-all-the-work-twinkly-kite.md for the full release sequence + locked decisions A–Z + R6-A through R6-H + R7-A through R7-C.