v0.7.x → v0.8.0 migration#
v0.8.0 is a small BREAKING release focused on closing the v0.7.0/0.7.1
__version__ mismatch and locking down the ECE input contract.
The [parquet] extra is also formalized.
At a glance#
Change |
Type |
|---|---|
|
BUG fix; tightens version detection |
|
Documented as enforced behavior; was already enforced in code but the v0.3 audit had flagged it as P1 — v0.8.0 is the release that locks it in via parametric regression tests |
|
Additive packaging change |
4 new methodology chapters ( |
Additive docs |
New |
Additive metric |
1. __version__ mismatch#
v0.7.1 had a bug: pyproject.toml was 0.7.1 but
src/eval_toolkit/__init__.py:13 still said __version__ = "0.7.0".
Consumer code that branches on eval_toolkit.__version__ would see
the wrong value.
v0.8.0 closes this. No action needed unless you were relying on the wrong value:
import eval_toolkit
# Current releases retain the fixed version contract from v0.8 onward.
major, minor, *_ = eval_toolkit.__version__.split(".")
assert (int(major), int(minor)) >= (0, 8)
2. ECE input validation#
The five ECE functions in eval_toolkit.metrics
(expected_calibration_error, expected_calibration_error_debiased,
expected_calibration_error_l2,
expected_calibration_error_l2_debiased,
expected_calibration_error_equal_mass) raise ValueError when
y_score falls outside [0, 1].
This was already enforced in code (the helper
_validate_calibrated_score was wired in pre-v0.8). v0.8.0 adds
parametric regression tests so the contract can’t silently regress
in future releases. If your code was already passing valid
probabilities, nothing changes. If you were silently passing
uncalibrated logits and getting nonsense ECE numbers, you’ll now see
a clear error.
Decoding the ValueError#
ValueError: y_score must be in [0, 1] for calibration metrics; got
range [-2.5, 4.0]. If you have logits, apply softmax/sigmoid first.
Migration#
import numpy as np
from eval_toolkit import expected_calibration_error
# logits, not probabilities → fails fast
logits = np.array([-2.0, 1.5, -0.5, 3.0])
y = np.array([0, 1, 0, 1])
# Wrong (raises):
# expected_calibration_error(y, logits, n_bins=4)
# Right — sigmoid first:
probs = 1 / (1 + np.exp(-logits))
ece = expected_calibration_error(y, probs, n_bins=4)
print(f"ECE = {ece:.3f}")
For binary classification with two-column logits (shape (n, 2)),
softmax + take column 1:
import numpy as np
from eval_toolkit import expected_calibration_error
logits_2d = np.array([[1.5, -0.3], [-0.8, 2.1], [0.0, 0.5], [-1.0, -0.2]])
y = np.array([0, 1, 1, 0])
# Softmax across columns; take P(y=1):
probs_2d = np.exp(logits_2d) / np.exp(logits_2d).sum(axis=1, keepdims=True)
p_pos = probs_2d[:, 1]
ece = expected_calibration_error(y, p_pos, n_bins=4)
print(f"ECE = {ece:.3f}")
For PyTorch logits, see also
methodology/calibration.md §”PyTorch & transformer specifics”
.
3. [parquet] extra#
v0.7.x had pyarrow in [dev] only — consumers using
ParquetGlobLoader had to install the entire dev dependency stack.
v0.8.0 splits it into a focused [parquet] extra:
# v0.7.x (worked but pulled in pytest, ruff, black, mypy, ...):
pip install "eval-toolkit[dev]"
# v0.8.0 (focused):
pip install "eval-toolkit[parquet]"
[dev] continues to depend on [parquet] so CI still exercises
ParquetGlobLoader.
4. New helper: quantile_stratified_report#
Additive — wraps existing quantile_stratified_pr_auc into the
four-field SDD reporting shape ({full, trimmed, gap, gap_flag}).
See methodology/length_stratification.md
for the methodology motivation.
5. New methodology chapters#
Four new chapters:
bootstrap.md— BCa derivation, paired CIs, MDE, two-level bootstrap, K-fold CV-CI.text_dedup.md— when to use eachSimilarityStrategy; threshold tuning; LSH false-negative rates.versioning.md— theVersionedProtocol; how to exposeversionon consumer Scorers; lm-eval pattern.length_stratification.md—quantile_stratified_report, McClish 1989 partial-AUC framing,gap_flagconvention.
6. New docs/roadmap.md#
Forward-looking tracker; cross-links consumer gap docs.
See also#
docs/migration/v0.7.md— v0.6 → v0.7 (the larger BREAKING release; if you’re upgrading from v0.6.x, read both).CHANGELOG.md— full release notes.