# Migrating to v0.48

The v0.48 release is the **last polish minor before v1.0**. It closes the
v1.0 sprint's "polish + audit-driven tightening" theme: Round 7 audit
follow-on, the v0.46 `BootstrapCI.to_dict()` schema rewrite, cross-API
shape-validation consistency, and the v0.48 §5E-prep packet-drift fixes
to the methodology docs.

If you're jumping from v0.46 (or earlier) and have not yet migrated
through v0.47, read `migration/v0.47.md` first.

## What's BREAKING at v0.48

### 1. `BootstrapCI.to_dict()` + `PairedBootstrapCI.to_dict()` schema rewrite

The pre-v0.48 schema hard-coded a `"ci_95"` key regardless of the
actual `confidence` field. At `confidence=0.90` the output looked like:

```text
{"point_estimate": 0.5, "ci_95": [0.4, 0.6], "confidence": 0.90, ...}
```

The `"ci_95"` key contradicted the `"confidence"` field. v0.48 names
the bounds neutrally; consumers interpret semantics from the
`confidence` field.

**Before v0.48:**

```text
ci.to_dict()
# {"point_estimate": 0.5, "ci_95": [0.4, 0.6], "confidence": 0.95,
#  "n_resamples": 1000, "method": "BCa"}
```

**v0.48+:**

```python
from eval_toolkit.bootstrap import BootstrapCI

ci = BootstrapCI(
    point_estimate=0.5, ci_low=0.4, ci_high=0.6,
    confidence=0.95, n_resamples=1000, method="BCa",
)
ci.to_dict()
# {"point": 0.5, "low": 0.4, "high": 0.6, "confidence": 0.95,
#  "n_resamples": 1000, "method": "BCa"}
```

**Migration**: rename `point_estimate` → `point`; replace the
`ci_95` list-of-two with separate `low` + `high` keys.

```text
# Before (illustrative — will fail at v0.48+):
d = ci.to_dict()
p = d["point_estimate"]
lo, hi = d["ci_95"]

# After:
d = ci.to_dict()
p = d["point"]
lo, hi = d["low"], d["high"]
```

Same rewrite applies to `PairedBootstrapCI.to_dict()`:

```text
# Before: {"delta": 0.1, "ci_95": [0.05, 0.15], "overlaps_zero": False, ...}
# After:  {"delta": 0.1, "low": 0.05, "high": 0.15, "overlaps_zero": False, ...}
```

### 2. `sweep()` adds `strategy_id` column + rejects duplicates

The `sweep()` DataFrame schema grew by one column (`strategy_id`,
inserted between `text_id` and `variant`):

**Before v0.48:**

```text
columns: text_id, variant, transformed_text[, original_score,
         transformed_score, asr]
```

**v0.48+:**

```text
columns: text_id, strategy_id, variant, transformed_text[, original_score,
         transformed_score, asr]
```

`strategy_id` is a canonical per-row identifier built from the
strategy's configured kwargs (e.g.,
`"delimit/delimiter='<<',end='>>'"`). It exists so downstream analysis
can disambiguate two configured instances of the same dataclass that
share `.name`. `variant` keeps the pre-v0.48 shape for backward-compat
`groupby` queries.

Callers indexing the DataFrame by column position must re-check
offsets. Callers indexing by column name are unaffected.

`sweep()` now **rejects** two strategies that produce the same
`strategy_id`:

```text
# Illustrative — this CALL deliberately raises at v0.48+ to surface
# the silent-merge anti-pattern that pre-v0.48 hid:
from eval_toolkit import sweep, DelimitVariant

sweep([DelimitVariant(), DelimitVariant()], ["hello"])
# ValueError: sweep(): duplicate strategy_id "delimit/..." at index 1
#             (previously at index 0); each strategy must produce a unique
#             strategy_id. If you want two configurations of the same
#             dataclass in the same sweep, vary their kwargs so the
#             canonical identifier differs.
```

If you want to sweep over multiple configurations of the same
dataclass, vary the kwargs (the canonical pattern — this one executes
cleanly):

```python
from eval_toolkit import sweep, DelimitVariant

texts = ["hello", "world"]
df = sweep(
    [DelimitVariant(delimiter="<<"), DelimitVariant(delimiter="[[")],
    texts,
)
# Both rows survive; strategy_id distinguishes them. df.groupby("strategy_id")
# is the canonical disambiguation pattern.
print(df["strategy_id"].unique().tolist())
```

### 3. `sweep()` validates scorer output shape

A `Scorer` that returns a wrong-shape array now raises an API-level
`ValueError` at the `sweep()` boundary:

```text
# Illustrative — this CALL deliberately raises at v0.48+:
import numpy as np
from eval_toolkit import sweep, DelimitVariant

class _BadScorer:
    def predict_proba(self, X):
        return np.array([0.5] * (len(X) + 1))  # one too many scores

# v0.48 raises immediately at the sweep boundary:
sweep([DelimitVariant()], ["a", "b"], scorer=_BadScorer(), attack_threshold=0.5)
# ValueError: sweep(): scorer.predict_proba(original-texts batch) returned
#             shape (3,); expected (2,). The Scorer Protocol requires one
#             float P(positive) per input row...
```

**Pre-v0.48**: silent truncation (overlong), `IndexError` (short), or
`TypeError` (matrix-shaped) — all low-level numpy errors that didn't
identify the offending scorer call.

## What's added at v0.48 (additive — no migration needed)

- **`make pre-push`** target — local-dev gate that mirrors CI's 3 doc-
  execution surfaces (Sybil + MyST-NB + `--doctest-modules`). The
  Sub-PR-7 incident postmortem (`feedback_sybil_python_blocks`)
  motivates this — `pytest tests/` silently overrides `testpaths` and
  drops 159 Sybil items from collection. `make pre-push` runs without
  the positional path arg so all three surfaces stay covered.
- **`nb_execution_raise_on_error = True`** in `docs/source/conf.py` —
  docs CI now fails on notebook execution errors instead of leaving
  them as advisory warnings (Decision R7-A; closes R7-F1).
- **`.doctest-modules` expanded** from 11 → 21 modules. `make test` +
  CI now catch future drift in 10 additional modules' in-source
  docstring examples.
- **ADR 0001** (flat-module layout, finalized) + **ADR 0003**
  (stability contract + Gate 3 methodology, finalized).
- **Standardized `ImportError` messages** across all lazy-extras
  surfaces. Every `ImportError` raise now follows the canonical
  template: `"<feature> requires <pkg>. Install with: pip install
  eval-toolkit[<extra>]"`.
- **Cross-API shape-validation consistency** — `metrics_at_threshold`,
  `paired_bootstrap_op_point_diff`, `bootstrap_metric_from_predictions`,
  the `metrics.py` scalars, and the `fit_*_binary` calibrator family
  all now validate input shape at their API boundaries with
  contextual `ValueError`s (no low-level numpy errors leaking).
- **`paired_bootstrap_op_point_diff` defensive guard** — passing the
  same array for `val_y` + `test_y` raises `ValueError` (Round 5 R5-F6e
  finding; the two-level bootstrap assumes disjoint partitions).
- **Documentation polish** — `SynonymSubstitution` whitelist `Notes`
  section; `Scorecard.to_pandas()` dtype coercion `Notes`;
  `CostSensitiveSelector` calibrated-prior warning; Round 5 packet-
  drift fixes across 7 methodology pages.

## Migration checklist

Before bumping the pin to `eval-toolkit==0.48.0`:

- [ ] Replace `d["point_estimate"]` → `d["point"]`; replace
  `d["ci_95"]` → `(d["low"], d["high"])` everywhere you consume
  `BootstrapCI.to_dict()` or `PairedBootstrapCI.to_dict()` output.
- [ ] Audit `sweep()` callsites for column-position indexing — the
  DataFrame now has 4 columns before the optional scorer columns
  (was 3). Switch to column-name indexing if you weren't already.
- [ ] Audit `sweep()` callsites for intentional duplicate-instance
  sweeps. If you pass the same configured strategy twice, either
  remove the duplicate or vary the kwargs.
- [ ] If any of your `Scorer` adapters return wrong-shape arrays
  (especially silent overlong), fix them — `sweep()` now refuses
  to silently truncate.
- [ ] Run your test suite against the new pin; the v0.47→v0.48
  transition surfaces every removed callsite as a `ValueError` or
  `KeyError` at runtime.

## What's next (v1.0 stability commitment)

After v0.48 ships and observes ≥1 consumer cycle, the Round 8 audit
STOP-GATE (Decision Y.2) opens. Final Codex + Gemini pass against
the complete pre-v1.0 packet, then `v1.0.0`:

- No new code at v1.0 — content-identical to v0.48 modulo the
  version bump + roadmap edits + ADR finalization confirmation.
- All 4 v1.0 gates closed: Gate 1 (consumer cycle), Gate 2 (Protocol
  stability), Gate 3 (multi-model cross-review), Gate 4 (Croissant
  e2e — already MET at v0.41).

See the v1.0 sprint plan at
`~/.claude/plans/evaluate-all-the-work-twinkly-kite.md` for the full
release sequence + locked decisions A–Z + R6-A through R6-H + R7-A
through R7-C.