v0.7.x → v0.8.0 migration#

v0.8.0 is a small BREAKING release focused on closing the v0.7.0/0.7.1 __version__ mismatch and locking down the ECE input contract. The [parquet] extra is also formalized.

At a glance#

Change

Type

__version__ was "0.7.0" in v0.7.1 wheel; v0.8.0 fixes the mismatch

BUG fix; tightens version detection

expected_calibration_error* (5 functions) raise ValueError on y_score [0, 1]

Documented as enforced behavior; was already enforced in code but the v0.3 audit had flagged it as P1 — v0.8.0 is the release that locks it in via parametric regression tests

pyarrow moved from [dev] only to a new [parquet] extra

Additive packaging change

4 new methodology chapters (bootstrap, text_dedup, versioning, length_stratification) + roadmap

Additive docs

New quantile_stratified_report helper

Additive metric

1. __version__ mismatch#

v0.7.1 had a bug: pyproject.toml was 0.7.1 but src/eval_toolkit/__init__.py:13 still said __version__ = "0.7.0". Consumer code that branches on eval_toolkit.__version__ would see the wrong value.

v0.8.0 closes this. No action needed unless you were relying on the wrong value:

import eval_toolkit
# Current releases retain the fixed version contract from v0.8 onward.
major, minor, *_ = eval_toolkit.__version__.split(".")
assert (int(major), int(minor)) >= (0, 8)

2. ECE input validation#

The five ECE functions in eval_toolkit.metrics (expected_calibration_error, expected_calibration_error_debiased, expected_calibration_error_l2, expected_calibration_error_l2_debiased, expected_calibration_error_equal_mass) raise ValueError when y_score falls outside [0, 1].

This was already enforced in code (the helper _validate_calibrated_score was wired in pre-v0.8). v0.8.0 adds parametric regression tests so the contract can’t silently regress in future releases. If your code was already passing valid probabilities, nothing changes. If you were silently passing uncalibrated logits and getting nonsense ECE numbers, you’ll now see a clear error.

Decoding the ValueError#

ValueError: y_score must be in [0, 1] for calibration metrics; got
range [-2.5, 4.0]. If you have logits, apply softmax/sigmoid first.

Migration#

import numpy as np
from eval_toolkit import expected_calibration_error

# logits, not probabilities → fails fast
logits = np.array([-2.0, 1.5, -0.5, 3.0])
y = np.array([0, 1, 0, 1])

# Wrong (raises):
# expected_calibration_error(y, logits, n_bins=4)

# Right — sigmoid first:
probs = 1 / (1 + np.exp(-logits))
ece = expected_calibration_error(y, probs, n_bins=4)
print(f"ECE = {ece:.3f}")

For binary classification with two-column logits (shape (n, 2)), softmax + take column 1:

import numpy as np
from eval_toolkit import expected_calibration_error

logits_2d = np.array([[1.5, -0.3], [-0.8, 2.1], [0.0, 0.5], [-1.0, -0.2]])
y = np.array([0, 1, 1, 0])

# Softmax across columns; take P(y=1):
probs_2d = np.exp(logits_2d) / np.exp(logits_2d).sum(axis=1, keepdims=True)
p_pos = probs_2d[:, 1]
ece = expected_calibration_error(y, p_pos, n_bins=4)
print(f"ECE = {ece:.3f}")

For PyTorch logits, see also methodology/calibration.md §”PyTorch & transformer specifics” .

3. [parquet] extra#

v0.7.x had pyarrow in [dev] only — consumers using ParquetGlobLoader had to install the entire dev dependency stack. v0.8.0 splits it into a focused [parquet] extra:

# v0.7.x (worked but pulled in pytest, ruff, black, mypy, ...):
pip install "eval-toolkit[dev]"

# v0.8.0 (focused):
pip install "eval-toolkit[parquet]"

[dev] continues to depend on [parquet] so CI still exercises ParquetGlobLoader.

4. New helper: quantile_stratified_report#

Additive — wraps existing quantile_stratified_pr_auc into the four-field SDD reporting shape ({full, trimmed, gap, gap_flag}). See methodology/length_stratification.md for the methodology motivation.

5. New methodology chapters#

Four new chapters:

  • bootstrap.md — BCa derivation, paired CIs, MDE, two-level bootstrap, K-fold CV-CI.

  • text_dedup.md — when to use each SimilarityStrategy; threshold tuning; LSH false-negative rates.

  • versioning.md — the Versioned Protocol; how to expose version on consumer Scorers; lm-eval pattern.

  • length_stratification.mdquantile_stratified_report, McClish 1989 partial-AUC framing, gap_flag convention.

6. New docs/roadmap.md#

Forward-looking tracker; cross-links consumer gap docs.

See also#