Calibration#

Background (skip if you’ve internalized this). A classifier is calibrated if its predicted probabilities match observed frequencies — among rows where the model says “70 % positive”, roughly 70 % are positive. Modern neural networks are routinely miscalibrated: confidence systematically diverges from accuracy (Guo et al., 2017). Miscalibrated probabilities aren’t just an aesthetic problem — they break cost-sensitive thresholding (thresholds.md), Bayes-optimal decision rules, and any downstream system that treats the score as a probability (selective prediction, ensembling, abstain-cost analysis).

This chapter covers how to measure calibration (ECE variants, Brier decomposition, reliability diagrams), how to fix it (temperature, isotonic, Platt), and when not to.

Setup#

import numpy as np
from eval_toolkit import (
    expected_calibration_error,
    expected_calibration_error_l2,
    expected_calibration_error_debiased,
    expected_calibration_error_l2_debiased,
    expected_calibration_error_equal_mass,
    brier_score, brier_decomposition,
    reliability_curve,
)

A 500-row miscalibrated fixture (overconfident scores: shifted away from 0.5):

rng = np.random.default_rng(42)
y = rng.binomial(1, 0.4, size=500).astype(int)
# Model output: correct on the rank, but overconfident.
linear = 0.7 * y + 0.3 * rng.normal(0, 0.5, size=500)
s_overconfident = np.clip(0.5 + np.tanh(linear * 2.5) * 0.45, 0, 1)

Reliability diagram#

The visual canonical: bin the predictions, plot mean predicted probability vs observed positive rate per bin. Diagonal = perfect calibration.

curve = reliability_curve(y, s_overconfident, n_bins=10, strategy="quantile")
print(f"n_bins={curve['n_bins']}  ece_equal_mass={curve['ece_equal_mass']:.3f}")
# `prob_true` / `prob_pred` arrays plot as the reliability diagram;
# eval_toolkit.plotting.plot_reliability_diagram(...) renders it.

strategy="quantile" (equal-mass binning) is preferred over "uniform" (equal-width) under class imbalance — equal-width concentrates most mass in 1–2 bins and the calibration signal collapses.

ECE variants#

Expected Calibration Error: a single-number summary of the reliability diagram. Four variants ship with eval-toolkit, differing on (a) L1 vs L2 norm and (b) plug-in vs debiased.

Variant

Function

Norm

Debiased

L1 plug-in

expected_calibration_error

L1

no

L1 debiased

expected_calibration_error_debiased

L1

yes

L2 plug-in

expected_calibration_error_l2

L2

no

L2 debiased

expected_calibration_error_l2_debiased

L2

yes

e1   = expected_calibration_error(y, s_overconfident, n_bins=10)
e1_d = expected_calibration_error_debiased(y, s_overconfident, n_bins=10)
e2   = expected_calibration_error_l2(y, s_overconfident, n_bins=10)
e2_d = expected_calibration_error_l2_debiased(y, s_overconfident, n_bins=10)
print(f"L1: {e1:.4f} (plug-in)  {e1_d:.4f} (debiased)")
print(f"L2: {e2:.4f} (plug-in)  {e2_d:.4f} (debiased)")

Which to use.

  • L2-debiased is the toolkit default for reporting — preserves rank ordering across bin counts (Naeini et al., 2015) and the debiasing correction removes the small-sample inflation Kumar et al. (2019) document. Pitfall: L1 plug-in can swap rank when bin count changes.

  • L1 plug-in matches sklearn’s calibration error and many published results — use it for comparison with prior work, not for decision-making.

  • Equal-mass (quantile) ECE is more robust to imbalance than equal-width:

e_eqmass = expected_calibration_error_equal_mass(y, s_overconfident, n_bins=10)
print(f"L1 equal-mass: {e_eqmass:.4f}")

What NOT to do. Don’t compare ECE across two models computed with different bin counts. ECE is a binned estimator — small bin counts understate, large bin counts overstate, the bias direction depends on sample size. Pin n_bins per project and document it.

Brier score & decomposition#

The Brier score (mean squared probability error) is the proper-scoring-rule analogue of ECE — it’s threshold-free, sensitive to calibration AND ranking, and decomposes additively into three interpretable components (Murphy, 1973).

\[\text{BS} = \text{REL} - \text{RES} + \text{UNC}\]
  • REL (reliability, lower better): squared distance between predicted probability and empirical positive rate per bin.

  • RES (resolution, higher better): variance of bin rates around the marginal — how much the model separates outcomes.

  • UNC (uncertainty, irreducible): the marginal Bernoulli variance \(\bar y (1-\bar y)\).

bs = brier_score(y, s_overconfident)
parts = brier_decomposition(y, s_overconfident, n_bins=10)
print(f"Brier: {bs:.4f}")
print(f"  reliability={parts['reliability']:.4f}  "
      f"resolution={parts['resolution']:.4f}  "
      f"uncertainty={parts['uncertainty']:.4f}")
print(f"  identity check: REL - RES + UNC = "
      f"{parts['reliability'] - parts['resolution'] + parts['uncertainty']:.4f}")

A model can have low Brier from either good calibration (low REL) or strong separation (high RES). The decomposition makes this trade-off visible — two models with the same Brier may have very different operational profiles.

Recalibration#

When ECE / REL is high, the model can often be recalibrated post-hoc — fit a 1-D function on validation data that maps raw scores to calibrated probabilities, leaving the rank unchanged.

Temperature scaling (Guo et al. 2017)#

Single-parameter: divide logits by T before softmax. Preserves accuracy exactly (argmax is invariant to monotone scaling). Simplest and most common for transformer outputs.

from eval_toolkit import fit_temperature

# fit_temperature wants logits as shape (n, 2): col 0 = neg, col 1 = pos.
val_logits = np.column_stack([1 - linear, linear])
val_labels = y
result = fit_temperature(val_logits, val_labels)
print(f"T*={result['temperature']:.3f}  NLL: {result['nll_pre']:.3f} -> {result['nll_post']:.3f}")

Isotonic regression#

Non-parametric monotone fit. More flexible than temperature, requires more validation data (sklearn rule of thumb: ≥ 1000 rows). Doesn’t necessarily preserve smoothness of the score distribution.

from eval_toolkit import fit_isotonic_calibrator

apply = fit_isotonic_calibrator(y, s_overconfident)
s_calibrated = apply(s_overconfident)
print(f"ECE before: {expected_calibration_error_l2_debiased(y, s_overconfident):.4f}")
print(f"ECE after:  {expected_calibration_error_l2_debiased(y, s_calibrated):.4f}")

Platt scaling#

Sigmoid fit (logistic regression with a single scaled+shifted feature). Two parameters; more flexible than temperature, less than isotonic.

from eval_toolkit import fit_platt_calibrator

apply = fit_platt_calibrator(y, s_overconfident)
s_calibrated = apply(s_overconfident)
print(f"ECE after Platt: {expected_calibration_error_l2_debiased(y, s_calibrated):.4f}")

When NOT to recalibrate#

  • You have <500 validation rows. Recalibrators overfit small samples; the post-cal ECE on a fresh test set can be worse than uncalibrated.

  • The score distribution is bimodal at 0/1. Many production classifiers output near-deterministic predictions. Recalibration can’t add information; it just smooths the distribution. ECE will improve mechanically but the decision rule is unchanged.

  • You’re comparing two raw models. ECE comparison must be on the raw outputs unless you explicitly note “after recalibration on shared val set”. Recalibrating one but not the other is unfair.

  • Production won’t apply the calibrator. If the deployment ships raw model outputs, your calibrated ECE is fiction.

PyTorch & transformer specifics#

Logit-domain calibration#

Temperature scaling is computed on logits, not probabilities. With HuggingFace transformers, this is model(...).logits before applying softmax. Calibrating after softmax loses information (the post-softmax distribution has already saturated).

# Sketch — requires torch / transformers, marked skip for Sybil.
import torch  # noqa
# logits = model(input_ids).logits  # shape (batch, 2) for binary
# T = fit_temperature(logits.cpu().numpy(), labels.cpu().numpy())["temperature"]
# probs = torch.softmax(logits / T, dim=-1)

fp16 / bf16 numerics#

Mixed-precision inference (fp16, bf16) introduces small numerical noise in logits, which softmax amplifies in the tail. Empirically: ECE under bf16 is ~0.001–0.005 higher than fp32 on the same model, depending on batch size. Two implications:

  1. Calibrate at inference precision. If production uses bf16, fit temperature on bf16 logits — not on fp32 logits cast back.

  2. Don’t compare ECE across precision levels. A 1 % ECE delta is well within fp16/bf16 noise for moderate-size eval sets.

Calibration drift across checkpoints#

A transformer’s calibration changes during fine-tuning even when its accuracy plateaus. Fit temperature on the same checkpoint you’ll deploy; don’t reuse a temperature from a previous epoch. Reproducibility.md discusses why the checkpoint hash should land in the manifest’s code_versions.

Pitfalls / Common mistakes#

  • Reporting ECE on uncalibrated logits. ECE is only meaningful when scores are in [0, 1] and interpretable as P(y=1 | x). The toolkit’s expected_calibration_error* functions raise ValueError if scores fall outside [0, 1] — apply softmax / sigmoid first.

  • Picking n_bins arbitrarily. Both equal-width and equal-mass ECE are sensitive to bin count. The toolkit defaults to 10; document and pin whatever you choose. Cross-paper comparisons require the same n_bins.

  • Comparing L1 and L2 ECE numerically. They’re on different scales (L1 is bounded by 1; L2 by 1 too but typically smaller). Pick one, document the choice.

  • Recalibrating on the test set. Use a validation slice carved off the train fold. Recalibrating on test is a direct leakage of the metric you’re about to report.

  • Ignoring single-class slices. ECE is degenerate when y is all-positive or all-negative. The toolkit’s reliability_curve flags these with a "skipped" marker.

Putting it all together#

# Full calibration audit on a single slice.
bs = brier_score(y, s_overconfident)
ece = expected_calibration_error_l2_debiased(y, s_overconfident, n_bins=10)
parts = brier_decomposition(y, s_overconfident, n_bins=10)
curve = reliability_curve(y, s_overconfident, n_bins=10, strategy="quantile")

print(f"Brier: {bs:.4f}")
print(f"  reliability={parts['reliability']:.4f}  resolution={parts['resolution']:.4f}")
print(f"L2-debiased ECE: {ece:.4f}")
print(f"Equal-mass L1 ECE: {curve['ece_equal_mass']:.4f}")
print(f"  (quantile bins; n_bins={curve['n_bins']})")

Further reading#

  • Guo, C. et al. On Calibration of Modern Neural Networks. ICML 2017. arXiv:1706.04599 — temperature scaling, the canonical post-hoc method.

  • Naeini, M. P. et al. Obtaining Well Calibrated Probabilities Using Bayesian Binning. AAAI 2015. arXiv:1411.0760 — ECE definition and the equal-mass-binning rationale.

  • Kumar, A., Liang, P., & Ma, T. Verified Uncertainty Calibration. NeurIPS 2019. arXiv:1909.10155 — debiased ECE estimators.

  • Murphy, A. H. A new vector partition of the probability score. J. Appl. Meteorology 12, 1973 — the Brier decomposition.

  • Nixon, J. et al. Measuring Calibration in Deep Learning. CVPRW 2019.

  • sklearn: calibration_curve, CalibratedClassifierCV.

See also: thresholds.md (calibrate before CostSensitiveSelector), comparison.md (paired bootstrap on ECE differences).