# v0.7.x → v0.8.0 migration

v0.8.0 is **a small BREAKING release** focused on closing the v0.7.0/0.7.1
`__version__` mismatch and locking down the ECE input contract.
The `[parquet]` extra is also formalized.

## At a glance

| Change | Type |
|---|---|
| `__version__` was `"0.7.0"` in v0.7.1 wheel; v0.8.0 fixes the mismatch | BUG fix; tightens version detection |
| `expected_calibration_error*` (5 functions) raise `ValueError` on `y_score ∉ [0, 1]` | Documented as enforced behavior; was already enforced in code but the v0.3 audit had flagged it as P1 — v0.8.0 is the release that locks it in via parametric regression tests |
| `pyarrow` moved from `[dev]` only to a new `[parquet]` extra | Additive packaging change |
| 4 new methodology chapters (`bootstrap`, `text_dedup`, `versioning`, `length_stratification`) + roadmap | Additive docs |
| New `quantile_stratified_report` helper | Additive metric |

## 1. `__version__` mismatch

**v0.7.1 had a bug**: `pyproject.toml` was `0.7.1` but
`src/eval_toolkit/__init__.py:13` still said `__version__ = "0.7.0"`.
Consumer code that branches on `eval_toolkit.__version__` would see
the wrong value.

v0.8.0 closes this. No action needed unless you were relying on the
wrong value:

```python
import eval_toolkit
# Current releases retain the fixed version contract from v0.8 onward.
major, minor, *_ = eval_toolkit.__version__.split(".")
assert (int(major), int(minor)) >= (0, 8)
```

## 2. ECE input validation

The five ECE functions in `eval_toolkit.metrics`
(`expected_calibration_error`, `expected_calibration_error_debiased`,
`expected_calibration_error_l2`,
`expected_calibration_error_l2_debiased`,
`expected_calibration_error_equal_mass`) raise `ValueError` when
`y_score` falls outside `[0, 1]`.

This was *already* enforced in code (the helper
`_validate_calibrated_score` was wired in pre-v0.8). v0.8.0 adds
parametric regression tests so the contract can't silently regress
in future releases. If your code was already passing valid
probabilities, nothing changes. If you were silently passing
uncalibrated logits and getting nonsense ECE numbers, you'll now see
a clear error.

### Decoding the `ValueError`

```text
ValueError: y_score must be in [0, 1] for calibration metrics; got
range [-2.5, 4.0]. If you have logits, apply softmax/sigmoid first.
```

### Migration

```python
import numpy as np
from eval_toolkit import expected_calibration_error

# logits, not probabilities → fails fast
logits = np.array([-2.0, 1.5, -0.5, 3.0])
y = np.array([0, 1, 0, 1])

# Wrong (raises):
# expected_calibration_error(y, logits, n_bins=4)

# Right — sigmoid first:
probs = 1 / (1 + np.exp(-logits))
ece = expected_calibration_error(y, probs, n_bins=4)
print(f"ECE = {ece:.3f}")
```

For binary classification with two-column logits (shape `(n, 2)`),
softmax + take column 1:

```python
import numpy as np
from eval_toolkit import expected_calibration_error

logits_2d = np.array([[1.5, -0.3], [-0.8, 2.1], [0.0, 0.5], [-1.0, -0.2]])
y = np.array([0, 1, 1, 0])

# Softmax across columns; take P(y=1):
probs_2d = np.exp(logits_2d) / np.exp(logits_2d).sum(axis=1, keepdims=True)
p_pos = probs_2d[:, 1]
ece = expected_calibration_error(y, p_pos, n_bins=4)
print(f"ECE = {ece:.3f}")
```

For PyTorch logits, see also
[`methodology/calibration.md` §"PyTorch & transformer specifics"
](../methodology/calibration.md#pytorch).

## 3. `[parquet]` extra

v0.7.x had `pyarrow` in `[dev]` only — consumers using
`ParquetGlobLoader` had to install the entire dev dependency stack.
v0.8.0 splits it into a focused `[parquet]` extra:

```bash
# v0.7.x (worked but pulled in pytest, ruff, black, mypy, ...):
pip install "eval-toolkit[dev]"

# v0.8.0 (focused):
pip install "eval-toolkit[parquet]"
```

`[dev]` continues to depend on `[parquet]` so CI still exercises
`ParquetGlobLoader`.

## 4. New helper: `quantile_stratified_report`

Additive — wraps existing `quantile_stratified_pr_auc` into the
four-field SDD reporting shape (`{full, trimmed, gap, gap_flag}`).
See [`methodology/length_stratification.md`](../methodology/length_stratification.md)
for the methodology motivation.

## 5. New methodology chapters

Four new chapters:

- [`bootstrap.md`](../methodology/bootstrap.md) — BCa derivation,
  paired CIs, MDE, two-level bootstrap, K-fold CV-CI.
- [`text_dedup.md`](../methodology/text_dedup.md) — when to use each
  `SimilarityStrategy`; threshold tuning; LSH false-negative rates.
- [`versioning.md`](../methodology/versioning.md) — the `Versioned`
  Protocol; how to expose `version` on consumer Scorers; lm-eval
  pattern.
- [`length_stratification.md`](../methodology/length_stratification.md)
  — `quantile_stratified_report`, McClish 1989 partial-AUC framing,
  `gap_flag` convention.

## 6. New `docs/roadmap.md`

Forward-looking tracker; cross-links consumer gap docs.

## See also

- [`docs/migration/v0.7.md`](v0.7.md) — v0.6 → v0.7 (the larger
  BREAKING release; if you're upgrading from v0.6.x, read both).
- [`CHANGELOG.md`](https://github.com/brandon-behring/eval-toolkit/blob/main/CHANGELOG.md) — full release notes.