Calibration#

Background (skip if you’ve internalized this). A classifier is calibrated if its predicted probabilities match observed frequencies — among rows where the model says “70 % positive”, roughly 70 % are positive. Modern neural networks are routinely miscalibrated: confidence systematically diverges from accuracy (Guo et al., 2017). Miscalibrated probabilities aren’t just an aesthetic problem — they break cost-sensitive thresholding (thresholds.md), Bayes-optimal decision rules, and any downstream system that treats the score as a probability (selective prediction, ensembling, abstain-cost analysis).

This chapter covers how to measure calibration (ECE variants, Brier decomposition, reliability diagrams), how to fix it (temperature, isotonic, Platt), and when not to.

Setup#

import numpy as np
from eval_toolkit import (
    expected_calibration_error,
    expected_calibration_error_l2,
    expected_calibration_error_debiased,
    expected_calibration_error_l2_debiased,
    expected_calibration_error_equal_mass,
    brier_score, brier_decomposition,
    reliability_curve,
)

A 500-row miscalibrated fixture (overconfident scores: shifted away from 0.5):

rng = np.random.default_rng(42)
y = rng.binomial(1, 0.4, size=500).astype(int)
# Model output: correct on the rank, but overconfident.
linear = 0.7 * y + 0.3 * rng.normal(0, 0.5, size=500)
s_overconfident = np.clip(0.5 + np.tanh(linear * 2.5) * 0.45, 0, 1)

Reliability diagram#

The visual canonical: bin the predictions, plot mean predicted probability vs observed positive rate per bin. Diagonal = perfect calibration.

curve = reliability_curve(y, s_overconfident, n_bins=10, strategy="quantile")
print(f"n_bins={curve['n_bins']}  ece_equal_mass={curve['ece_equal_mass']:.3f}")
# `prob_true` / `prob_pred` arrays plot as the reliability diagram;
# eval_toolkit.plotting.plot_reliability_diagram(...) renders it.

strategy="quantile" (equal-mass binning) is preferred over "uniform" (equal-width) under class imbalance — equal-width concentrates most mass in 1–2 bins and the calibration signal collapses.

ECE variants#

Expected Calibration Error: a single-number summary of the reliability diagram. Four variants ship with eval-toolkit, differing on (a) L1 vs L2 norm and (b) plug-in vs debiased.

Variant	Function	Norm	Debiased
L1 plug-in	`expected_calibration_error`	L1	no
L1 debiased	`expected_calibration_error_debiased`	L1	yes
L2 plug-in	`expected_calibration_error_l2`	L2	no
L2 debiased	`expected_calibration_error_l2_debiased`	L2	yes

e1   = expected_calibration_error(y, s_overconfident, n_bins=10)
e1_d = expected_calibration_error_debiased(y, s_overconfident, n_bins=10)
e2   = expected_calibration_error_l2(y, s_overconfident, n_bins=10)
e2_d = expected_calibration_error_l2_debiased(y, s_overconfident, n_bins=10)
print(f"L1: {e1:.4f} (plug-in)  {e1_d:.4f} (debiased)")
print(f"L2: {e2:.4f} (plug-in)  {e2_d:.4f} (debiased)")

Which to use.

L2-debiased is the toolkit default for reporting — preserves rank ordering across bin counts (Naeini et al., 2015) and the debiasing correction removes the small-sample inflation Kumar et al. (2019) document. Pitfall: L1 plug-in can swap rank when bin count changes.
L1 plug-in matches sklearn’s calibration error and many published results — use it for comparison with prior work, not for decision-making.
Equal-mass (quantile) ECE is more robust to imbalance than equal-width:

e_eqmass = expected_calibration_error_equal_mass(y, s_overconfident, n_bins=10)
print(f"L1 equal-mass: {e_eqmass:.4f}")

What NOT to do. Don’t compare ECE across two models computed with different bin counts. ECE is a binned estimator — small bin counts understate, large bin counts overstate, the bias direction depends on sample size. Pin n_bins per project and document it.

Brier score & decomposition#

The Brier score (mean squared probability error) is the proper-scoring-rule analogue of ECE — it’s threshold-free, sensitive to calibration AND ranking, and decomposes additively into three interpretable components (Murphy, 1973).

\[\text{BS} = \text{REL} - \text{RES} + \text{UNC}\]

REL (reliability, lower better): squared distance between predicted probability and empirical positive rate per bin.
RES (resolution, higher better): variance of bin rates around the marginal — how much the model separates outcomes.
UNC (uncertainty, irreducible): the marginal Bernoulli variance \(\bar y (1-\bar y)\).

bs = brier_score(y, s_overconfident)
parts = brier_decomposition(y, s_overconfident, n_bins=10)
print(f"Brier: {bs:.4f}")
print(f"  reliability={parts['reliability']:.4f}  "
      f"resolution={parts['resolution']:.4f}  "
      f"uncertainty={parts['uncertainty']:.4f}")
print(f"  identity check: REL - RES + UNC = "
      f"{parts['reliability'] - parts['resolution'] + parts['uncertainty']:.4f}")

A model can have low Brier from either good calibration (low REL) or strong separation (high RES). The decomposition makes this trade-off visible — two models with the same Brier may have very different operational profiles.

Recalibration#

When ECE / REL is high, the model can often be recalibrated post-hoc — fit a 1-D function on validation data that maps raw scores to calibrated probabilities, leaving the rank unchanged.

Temperature scaling (Guo et al. 2017)#

Single-parameter: divide logits by T before softmax. Preserves accuracy exactly (argmax is invariant to monotone scaling). Simplest and most common for transformer outputs.

from eval_toolkit import fit_temperature

# fit_temperature wants logits as shape (n, 2): col 0 = neg, col 1 = pos.
val_logits = np.column_stack([1 - linear, linear])
val_labels = y
result = fit_temperature(val_logits, val_labels)
print(f"T*={result['temperature']:.3f}  NLL: {result['nll_pre']:.3f} -> {result['nll_post']:.3f}")

Isotonic regression#

Non-parametric monotone fit. More flexible than temperature, requires more validation data (sklearn rule of thumb: ≥ 1000 rows). Doesn’t necessarily preserve smoothness of the score distribution.

from eval_toolkit import fit_isotonic_calibrator

apply = fit_isotonic_calibrator(y, s_overconfident)
s_calibrated = apply(s_overconfident)
print(f"ECE before: {expected_calibration_error_l2_debiased(y, s_overconfident):.4f}")
print(f"ECE after:  {expected_calibration_error_l2_debiased(y, s_calibrated):.4f}")

Platt scaling#

Sigmoid fit (logistic regression with a single scaled+shifted feature). Two parameters; more flexible than temperature, less than isotonic.

from eval_toolkit import fit_platt_calibrator

apply = fit_platt_calibrator(y, s_overconfident)
s_calibrated = apply(s_overconfident)
print(f"ECE after Platt: {expected_calibration_error_l2_debiased(y, s_calibrated):.4f}")

When NOT to recalibrate#

You have <500 validation rows. Recalibrators overfit small samples; the post-cal ECE on a fresh test set can be worse than uncalibrated.
The score distribution is bimodal at 0/1. Many production classifiers output near-deterministic predictions. Recalibration can’t add information; it just smooths the distribution. ECE will improve mechanically but the decision rule is unchanged.
You’re comparing two raw models. ECE comparison must be on the raw outputs unless you explicitly note “after recalibration on shared val set”. Recalibrating one but not the other is unfair.
Production won’t apply the calibrator. If the deployment ships raw model outputs, your calibrated ECE is fiction.

PyTorch & transformer specifics#

Logit-domain calibration#

Temperature scaling is computed on logits, not probabilities. With HuggingFace transformers, this is model(...).logits before applying softmax. Calibrating after softmax loses information (the post-softmax distribution has already saturated).

# Sketch — requires torch / transformers, marked skip for Sybil.
import torch  # noqa
# logits = model(input_ids).logits  # shape (batch, 2) for binary
# T = fit_temperature(logits.cpu().numpy(), labels.cpu().numpy())["temperature"]
# probs = torch.softmax(logits / T, dim=-1)

fp16 / bf16 numerics#

Mixed-precision inference (fp16, bf16) introduces small numerical noise in logits, which softmax amplifies in the tail. Empirically: ECE under bf16 is ~0.001–0.005 higher than fp32 on the same model, depending on batch size. Two implications:

Calibrate at inference precision. If production uses bf16, fit temperature on bf16 logits — not on fp32 logits cast back.
Don’t compare ECE across precision levels. A 1 % ECE delta is well within fp16/bf16 noise for moderate-size eval sets.

Calibration drift across checkpoints#

A transformer’s calibration changes during fine-tuning even when its accuracy plateaus. Fit temperature on the same checkpoint you’ll deploy; don’t reuse a temperature from a previous epoch. Reproducibility.md discusses why the checkpoint hash should land in the manifest’s code_versions.

Pitfalls / Common mistakes#

Reporting ECE on uncalibrated logits. ECE is only meaningful when scores are in [0, 1] and interpretable as P(y=1 | x). The toolkit’s expected_calibration_error* functions raise ValueError if scores fall outside [0, 1] — apply softmax / sigmoid first.
Picking n_bins arbitrarily. Both equal-width and equal-mass ECE are sensitive to bin count. The toolkit defaults to 10; document and pin whatever you choose. Cross-paper comparisons require the same n_bins.
Comparing L1 and L2 ECE numerically. They’re on different scales (L1 is bounded by 1; L2 by 1 too but typically smaller). Pick one, document the choice.
Recalibrating on the test set. Use a validation slice carved off the train fold. Recalibrating on test is a direct leakage of the metric you’re about to report.
Ignoring single-class slices. ECE is degenerate when y is all-positive or all-negative. The toolkit’s reliability_curve flags these with a "skipped" marker.

Putting it all together#

# Full calibration audit on a single slice.
bs = brier_score(y, s_overconfident)
ece = expected_calibration_error_l2_debiased(y, s_overconfident, n_bins=10)
parts = brier_decomposition(y, s_overconfident, n_bins=10)
curve = reliability_curve(y, s_overconfident, n_bins=10, strategy="quantile")

print(f"Brier: {bs:.4f}")
print(f"  reliability={parts['reliability']:.4f}  resolution={parts['resolution']:.4f}")
print(f"L2-debiased ECE: {ece:.4f}")
print(f"Equal-mass L1 ECE: {curve['ece_equal_mass']:.4f}")
print(f"  (quantile bins; n_bins={curve['n_bins']})")