# v0.7.x → v0.8.0 migration v0.8.0 is **a small BREAKING release** focused on closing the v0.7.0/0.7.1 `__version__` mismatch and locking down the ECE input contract. The `[parquet]` extra is also formalized. ## At a glance | Change | Type | |---|---| | `__version__` was `"0.7.0"` in v0.7.1 wheel; v0.8.0 fixes the mismatch | BUG fix; tightens version detection | | `expected_calibration_error*` (5 functions) raise `ValueError` on `y_score ∉ [0, 1]` | Documented as enforced behavior; was already enforced in code but the v0.3 audit had flagged it as P1 — v0.8.0 is the release that locks it in via parametric regression tests | | `pyarrow` moved from `[dev]` only to a new `[parquet]` extra | Additive packaging change | | 4 new methodology chapters (`bootstrap`, `text_dedup`, `versioning`, `length_stratification`) + roadmap | Additive docs | | New `quantile_stratified_report` helper | Additive metric | ## 1. `__version__` mismatch **v0.7.1 had a bug**: `pyproject.toml` was `0.7.1` but `src/eval_toolkit/__init__.py:13` still said `__version__ = "0.7.0"`. Consumer code that branches on `eval_toolkit.__version__` would see the wrong value. v0.8.0 closes this. No action needed unless you were relying on the wrong value: ```python import eval_toolkit # Current releases retain the fixed version contract from v0.8 onward. major, minor, *_ = eval_toolkit.__version__.split(".") assert (int(major), int(minor)) >= (0, 8) ``` ## 2. ECE input validation The five ECE functions in `eval_toolkit.metrics` (`expected_calibration_error`, `expected_calibration_error_debiased`, `expected_calibration_error_l2`, `expected_calibration_error_l2_debiased`, `expected_calibration_error_equal_mass`) raise `ValueError` when `y_score` falls outside `[0, 1]`. This was *already* enforced in code (the helper `_validate_calibrated_score` was wired in pre-v0.8). v0.8.0 adds parametric regression tests so the contract can't silently regress in future releases. If your code was already passing valid probabilities, nothing changes. If you were silently passing uncalibrated logits and getting nonsense ECE numbers, you'll now see a clear error. ### Decoding the `ValueError` ```text ValueError: y_score must be in [0, 1] for calibration metrics; got range [-2.5, 4.0]. If you have logits, apply softmax/sigmoid first. ``` ### Migration ```python import numpy as np from eval_toolkit import expected_calibration_error # logits, not probabilities → fails fast logits = np.array([-2.0, 1.5, -0.5, 3.0]) y = np.array([0, 1, 0, 1]) # Wrong (raises): # expected_calibration_error(y, logits, n_bins=4) # Right — sigmoid first: probs = 1 / (1 + np.exp(-logits)) ece = expected_calibration_error(y, probs, n_bins=4) print(f"ECE = {ece:.3f}") ``` For binary classification with two-column logits (shape `(n, 2)`), softmax + take column 1: ```python import numpy as np from eval_toolkit import expected_calibration_error logits_2d = np.array([[1.5, -0.3], [-0.8, 2.1], [0.0, 0.5], [-1.0, -0.2]]) y = np.array([0, 1, 1, 0]) # Softmax across columns; take P(y=1): probs_2d = np.exp(logits_2d) / np.exp(logits_2d).sum(axis=1, keepdims=True) p_pos = probs_2d[:, 1] ece = expected_calibration_error(y, p_pos, n_bins=4) print(f"ECE = {ece:.3f}") ``` For PyTorch logits, see also [`methodology/calibration.md` §"PyTorch & transformer specifics" ](../methodology/calibration.md#pytorch). ## 3. `[parquet]` extra v0.7.x had `pyarrow` in `[dev]` only — consumers using `ParquetGlobLoader` had to install the entire dev dependency stack. v0.8.0 splits it into a focused `[parquet]` extra: ```bash # v0.7.x (worked but pulled in pytest, ruff, black, mypy, ...): pip install "eval-toolkit[dev]" # v0.8.0 (focused): pip install "eval-toolkit[parquet]" ``` `[dev]` continues to depend on `[parquet]` so CI still exercises `ParquetGlobLoader`. ## 4. New helper: `quantile_stratified_report` Additive — wraps existing `quantile_stratified_pr_auc` into the four-field SDD reporting shape (`{full, trimmed, gap, gap_flag}`). See [`methodology/length_stratification.md`](../methodology/length_stratification.md) for the methodology motivation. ## 5. New methodology chapters Four new chapters: - [`bootstrap.md`](../methodology/bootstrap.md) — BCa derivation, paired CIs, MDE, two-level bootstrap, K-fold CV-CI. - [`text_dedup.md`](../methodology/text_dedup.md) — when to use each `SimilarityStrategy`; threshold tuning; LSH false-negative rates. - [`versioning.md`](../methodology/versioning.md) — the `Versioned` Protocol; how to expose `version` on consumer Scorers; lm-eval pattern. - [`length_stratification.md`](../methodology/length_stratification.md) — `quantile_stratified_report`, McClish 1989 partial-AUC framing, `gap_flag` convention. ## 6. New `docs/roadmap.md` Forward-looking tracker; cross-links consumer gap docs. ## See also - [`docs/migration/v0.7.md`](v0.7.md) — v0.6 → v0.7 (the larger BREAKING release; if you're upgrading from v0.6.x, read both). - [`CHANGELOG.md`](https://github.com/brandon-behring/eval-toolkit/blob/main/CHANGELOG.md) — full release notes.