# Migrating to v0.48 The v0.48 release is the **last polish minor before v1.0**. It closes the v1.0 sprint's "polish + audit-driven tightening" theme: Round 7 audit follow-on, the v0.46 `BootstrapCI.to_dict()` schema rewrite, cross-API shape-validation consistency, and the v0.48 §5E-prep packet-drift fixes to the methodology docs. If you're jumping from v0.46 (or earlier) and have not yet migrated through v0.47, read `migration/v0.47.md` first. ## What's BREAKING at v0.48 ### 1. `BootstrapCI.to_dict()` + `PairedBootstrapCI.to_dict()` schema rewrite The pre-v0.48 schema hard-coded a `"ci_95"` key regardless of the actual `confidence` field. At `confidence=0.90` the output looked like: ```text {"point_estimate": 0.5, "ci_95": [0.4, 0.6], "confidence": 0.90, ...} ``` The `"ci_95"` key contradicted the `"confidence"` field. v0.48 names the bounds neutrally; consumers interpret semantics from the `confidence` field. **Before v0.48:** ```text ci.to_dict() # {"point_estimate": 0.5, "ci_95": [0.4, 0.6], "confidence": 0.95, # "n_resamples": 1000, "method": "BCa"} ``` **v0.48+:** ```python from eval_toolkit.bootstrap import BootstrapCI ci = BootstrapCI( point_estimate=0.5, ci_low=0.4, ci_high=0.6, confidence=0.95, n_resamples=1000, method="BCa", ) ci.to_dict() # {"point": 0.5, "low": 0.4, "high": 0.6, "confidence": 0.95, # "n_resamples": 1000, "method": "BCa"} ``` **Migration**: rename `point_estimate` → `point`; replace the `ci_95` list-of-two with separate `low` + `high` keys. ```text # Before (illustrative — will fail at v0.48+): d = ci.to_dict() p = d["point_estimate"] lo, hi = d["ci_95"] # After: d = ci.to_dict() p = d["point"] lo, hi = d["low"], d["high"] ``` Same rewrite applies to `PairedBootstrapCI.to_dict()`: ```text # Before: {"delta": 0.1, "ci_95": [0.05, 0.15], "overlaps_zero": False, ...} # After: {"delta": 0.1, "low": 0.05, "high": 0.15, "overlaps_zero": False, ...} ``` ### 2. `sweep()` adds `strategy_id` column + rejects duplicates The `sweep()` DataFrame schema grew by one column (`strategy_id`, inserted between `text_id` and `variant`): **Before v0.48:** ```text columns: text_id, variant, transformed_text[, original_score, transformed_score, asr] ``` **v0.48+:** ```text columns: text_id, strategy_id, variant, transformed_text[, original_score, transformed_score, asr] ``` `strategy_id` is a canonical per-row identifier built from the strategy's configured kwargs (e.g., `"delimit/delimiter='<<',end='>>'"`). It exists so downstream analysis can disambiguate two configured instances of the same dataclass that share `.name`. `variant` keeps the pre-v0.48 shape for backward-compat `groupby` queries. Callers indexing the DataFrame by column position must re-check offsets. Callers indexing by column name are unaffected. `sweep()` now **rejects** two strategies that produce the same `strategy_id`: ```text # Illustrative — this CALL deliberately raises at v0.48+ to surface # the silent-merge anti-pattern that pre-v0.48 hid: from eval_toolkit import sweep, DelimitVariant sweep([DelimitVariant(), DelimitVariant()], ["hello"]) # ValueError: sweep(): duplicate strategy_id "delimit/..." at index 1 # (previously at index 0); each strategy must produce a unique # strategy_id. If you want two configurations of the same # dataclass in the same sweep, vary their kwargs so the # canonical identifier differs. ``` If you want to sweep over multiple configurations of the same dataclass, vary the kwargs (the canonical pattern — this one executes cleanly): ```python from eval_toolkit import sweep, DelimitVariant texts = ["hello", "world"] df = sweep( [DelimitVariant(delimiter="<<"), DelimitVariant(delimiter="[[")], texts, ) # Both rows survive; strategy_id distinguishes them. df.groupby("strategy_id") # is the canonical disambiguation pattern. print(df["strategy_id"].unique().tolist()) ``` ### 3. `sweep()` validates scorer output shape A `Scorer` that returns a wrong-shape array now raises an API-level `ValueError` at the `sweep()` boundary: ```text # Illustrative — this CALL deliberately raises at v0.48+: import numpy as np from eval_toolkit import sweep, DelimitVariant class _BadScorer: def predict_proba(self, X): return np.array([0.5] * (len(X) + 1)) # one too many scores # v0.48 raises immediately at the sweep boundary: sweep([DelimitVariant()], ["a", "b"], scorer=_BadScorer(), attack_threshold=0.5) # ValueError: sweep(): scorer.predict_proba(original-texts batch) returned # shape (3,); expected (2,). The Scorer Protocol requires one # float P(positive) per input row... ``` **Pre-v0.48**: silent truncation (overlong), `IndexError` (short), or `TypeError` (matrix-shaped) — all low-level numpy errors that didn't identify the offending scorer call. ## What's added at v0.48 (additive — no migration needed) - **`make pre-push`** target — local-dev gate that mirrors CI's 3 doc- execution surfaces (Sybil + MyST-NB + `--doctest-modules`). The Sub-PR-7 incident postmortem (`feedback_sybil_python_blocks`) motivates this — `pytest tests/` silently overrides `testpaths` and drops 159 Sybil items from collection. `make pre-push` runs without the positional path arg so all three surfaces stay covered. - **`nb_execution_raise_on_error = True`** in `docs/source/conf.py` — docs CI now fails on notebook execution errors instead of leaving them as advisory warnings (Decision R7-A; closes R7-F1). - **`.doctest-modules` expanded** from 11 → 21 modules. `make test` + CI now catch future drift in 10 additional modules' in-source docstring examples. - **ADR 0001** (flat-module layout, finalized) + **ADR 0003** (stability contract + Gate 3 methodology, finalized). - **Standardized `ImportError` messages** across all lazy-extras surfaces. Every `ImportError` raise now follows the canonical template: `" requires . Install with: pip install eval-toolkit[]"`. - **Cross-API shape-validation consistency** — `metrics_at_threshold`, `paired_bootstrap_op_point_diff`, `bootstrap_metric_from_predictions`, the `metrics.py` scalars, and the `fit_*_binary` calibrator family all now validate input shape at their API boundaries with contextual `ValueError`s (no low-level numpy errors leaking). - **`paired_bootstrap_op_point_diff` defensive guard** — passing the same array for `val_y` + `test_y` raises `ValueError` (Round 5 R5-F6e finding; the two-level bootstrap assumes disjoint partitions). - **Documentation polish** — `SynonymSubstitution` whitelist `Notes` section; `Scorecard.to_pandas()` dtype coercion `Notes`; `CostSensitiveSelector` calibrated-prior warning; Round 5 packet- drift fixes across 7 methodology pages. ## Migration checklist Before bumping the pin to `eval-toolkit==0.48.0`: - [ ] Replace `d["point_estimate"]` → `d["point"]`; replace `d["ci_95"]` → `(d["low"], d["high"])` everywhere you consume `BootstrapCI.to_dict()` or `PairedBootstrapCI.to_dict()` output. - [ ] Audit `sweep()` callsites for column-position indexing — the DataFrame now has 4 columns before the optional scorer columns (was 3). Switch to column-name indexing if you weren't already. - [ ] Audit `sweep()` callsites for intentional duplicate-instance sweeps. If you pass the same configured strategy twice, either remove the duplicate or vary the kwargs. - [ ] If any of your `Scorer` adapters return wrong-shape arrays (especially silent overlong), fix them — `sweep()` now refuses to silently truncate. - [ ] Run your test suite against the new pin; the v0.47→v0.48 transition surfaces every removed callsite as a `ValueError` or `KeyError` at runtime. ## What's next (v1.0 stability commitment) After v0.48 ships and observes ≥1 consumer cycle, the Round 8 audit STOP-GATE (Decision Y.2) opens. Final Codex + Gemini pass against the complete pre-v1.0 packet, then `v1.0.0`: - No new code at v1.0 — content-identical to v0.48 modulo the version bump + roadmap edits + ADR finalization confirmation. - All 4 v1.0 gates closed: Gate 1 (consumer cycle), Gate 2 (Protocol stability), Gate 3 (multi-model cross-review), Gate 4 (Croissant e2e — already MET at v0.41). See the v1.0 sprint plan at `~/.claude/plans/evaluate-all-the-work-twinkly-kite.md` for the full release sequence + locked decisions A–Z + R6-A through R6-H + R7-A through R7-C.