# Calibration > **Background** *(skip if you've internalized this)*. A classifier is > *calibrated* if its predicted probabilities match observed > frequencies — among rows where the model says "70 % positive", roughly > 70 % are positive. Modern neural networks are routinely > *miscalibrated*: confidence systematically diverges from accuracy > (Guo et al., [2017](https://arxiv.org/abs/1706.04599)). Miscalibrated > probabilities aren't just an aesthetic problem — they break > cost-sensitive thresholding ([thresholds.md](thresholds.md)), > Bayes-optimal decision rules, and any downstream system that treats > the score as a probability (selective prediction, ensembling, > abstain-cost analysis). This chapter covers how to *measure* calibration (ECE variants, Brier decomposition, reliability diagrams), how to *fix* it (temperature, isotonic, Platt), and when *not* to. ## Setup ```python import numpy as np from eval_toolkit import ( expected_calibration_error, expected_calibration_error_l2, expected_calibration_error_debiased, expected_calibration_error_l2_debiased, expected_calibration_error_equal_mass, brier_score, brier_decomposition, reliability_curve, ) ``` A 500-row miscalibrated fixture (overconfident scores: shifted away from 0.5): ```python rng = np.random.default_rng(42) y = rng.binomial(1, 0.4, size=500).astype(int) # Model output: correct on the rank, but overconfident. linear = 0.7 * y + 0.3 * rng.normal(0, 0.5, size=500) s_overconfident = np.clip(0.5 + np.tanh(linear * 2.5) * 0.45, 0, 1) ``` (reliability)= ## Reliability diagram The visual canonical: bin the predictions, plot mean predicted probability vs observed positive rate per bin. Diagonal = perfect calibration. ```python curve = reliability_curve(y, s_overconfident, n_bins=10, strategy="quantile") print(f"n_bins={curve['n_bins']} ece_equal_mass={curve['ece_equal_mass']:.3f}") # `prob_true` / `prob_pred` arrays plot as the reliability diagram; # eval_toolkit.plotting.plot_reliability_diagram(...) renders it. ``` `strategy="quantile"` (equal-mass binning) is preferred over `"uniform"` (equal-width) under class imbalance — equal-width concentrates most mass in 1–2 bins and the calibration signal collapses. (ece-variants)= ## ECE variants Expected Calibration Error: a single-number summary of the reliability diagram. Four variants ship with eval-toolkit, differing on (a) L1 vs L2 norm and (b) plug-in vs debiased. | Variant | Function | Norm | Debiased | |---|---|---|---| | L1 plug-in | `expected_calibration_error` | L1 | no | | L1 debiased | `expected_calibration_error_debiased` | L1 | yes | | L2 plug-in | `expected_calibration_error_l2` | L2 | no | | L2 debiased | `expected_calibration_error_l2_debiased` | L2 | yes | ```python e1 = expected_calibration_error(y, s_overconfident, n_bins=10) e1_d = expected_calibration_error_debiased(y, s_overconfident, n_bins=10) e2 = expected_calibration_error_l2(y, s_overconfident, n_bins=10) e2_d = expected_calibration_error_l2_debiased(y, s_overconfident, n_bins=10) print(f"L1: {e1:.4f} (plug-in) {e1_d:.4f} (debiased)") print(f"L2: {e2:.4f} (plug-in) {e2_d:.4f} (debiased)") ``` **Which to use.** - **L2-debiased** is the toolkit default for *reporting* — preserves rank ordering across bin counts (Naeini et al., [2015](https://arxiv.org/abs/1411.0760)) and the debiasing correction removes the small-sample inflation Kumar et al. ([2019](https://arxiv.org/abs/1909.10155)) document. **Pitfall**: L1 plug-in can swap rank when bin count changes. - **L1 plug-in** matches sklearn's calibration error and many published results — use it for *comparison* with prior work, not for decision-making. - **Equal-mass** (quantile) ECE is more robust to imbalance than equal-width: ```python e_eqmass = expected_calibration_error_equal_mass(y, s_overconfident, n_bins=10) print(f"L1 equal-mass: {e_eqmass:.4f}") ``` > **What NOT to do.** Don't compare ECE across two models computed with > different bin counts. ECE is a *binned* estimator — small bin counts > understate, large bin counts overstate, the bias direction depends on > sample size. Pin n_bins per project and document it. (brier)= ## Brier score & decomposition The Brier score (mean squared probability error) is the *proper-scoring-rule* analogue of ECE — it's threshold-free, sensitive to calibration AND ranking, and decomposes additively into three interpretable components (Murphy, 1973). $$\text{BS} = \text{REL} - \text{RES} + \text{UNC}$$ - **REL** (reliability, lower better): squared distance between predicted probability and empirical positive rate per bin. - **RES** (resolution, higher better): variance of bin rates around the marginal — how much the model *separates* outcomes. - **UNC** (uncertainty, irreducible): the marginal Bernoulli variance $\bar y (1-\bar y)$. ```python bs = brier_score(y, s_overconfident) parts = brier_decomposition(y, s_overconfident, n_bins=10) print(f"Brier: {bs:.4f}") print(f" reliability={parts['reliability']:.4f} " f"resolution={parts['resolution']:.4f} " f"uncertainty={parts['uncertainty']:.4f}") print(f" identity check: REL - RES + UNC = " f"{parts['reliability'] - parts['resolution'] + parts['uncertainty']:.4f}") ``` A model can have low Brier from *either* good calibration (low REL) *or* strong separation (high RES). The decomposition makes this trade-off visible — two models with the same Brier may have very different operational profiles. (recalibration)= ## Recalibration When ECE / REL is high, the model can often be *recalibrated* post-hoc — fit a 1-D function on validation data that maps raw scores to calibrated probabilities, leaving the rank unchanged. ### Temperature scaling (Guo et al. 2017) Single-parameter: divide logits by T before softmax. Preserves accuracy exactly (argmax is invariant to monotone scaling). Simplest and most common for transformer outputs. ```python from eval_toolkit import fit_temperature # fit_temperature wants logits as shape (n, 2): col 0 = neg, col 1 = pos. val_logits = np.column_stack([1 - linear, linear]) val_labels = y result = fit_temperature(val_logits, val_labels) print(f"T*={result['temperature']:.3f} NLL: {result['nll_pre']:.3f} -> {result['nll_post']:.3f}") ``` ### Isotonic regression Non-parametric monotone fit. More flexible than temperature, requires more validation data (sklearn rule of thumb: ≥ 1000 rows). Doesn't necessarily preserve smoothness of the score distribution. ```python from eval_toolkit import fit_isotonic_calibrator apply = fit_isotonic_calibrator(y, s_overconfident) s_calibrated = apply(s_overconfident) print(f"ECE before: {expected_calibration_error_l2_debiased(y, s_overconfident):.4f}") print(f"ECE after: {expected_calibration_error_l2_debiased(y, s_calibrated):.4f}") ``` ### Platt scaling Sigmoid fit (logistic regression with a single scaled+shifted feature). Two parameters; more flexible than temperature, less than isotonic. ```python from eval_toolkit import fit_platt_calibrator apply = fit_platt_calibrator(y, s_overconfident) s_calibrated = apply(s_overconfident) print(f"ECE after Platt: {expected_calibration_error_l2_debiased(y, s_calibrated):.4f}") ``` (do-not-recalibrate)= ### When NOT to recalibrate - **You have <500 validation rows.** Recalibrators overfit small samples; the post-cal ECE on a fresh test set can be *worse* than uncalibrated. - **The score distribution is bimodal at 0/1.** Many production classifiers output near-deterministic predictions. Recalibration can't add information; it just smooths the distribution. ECE will improve mechanically but the decision rule is unchanged. - **You're comparing two raw models.** ECE comparison must be on the *raw* outputs unless you explicitly note "after recalibration on shared val set". Recalibrating one but not the other is unfair. - **Production won't apply the calibrator.** If the deployment ships raw model outputs, your calibrated ECE is fiction. (pytorch)= ## PyTorch & transformer specifics ### Logit-domain calibration Temperature scaling is computed on **logits**, not probabilities. With HuggingFace transformers, this is `model(...).logits` *before* applying softmax. Calibrating after softmax loses information (the post-softmax distribution has already saturated). ```python # Sketch — requires torch / transformers, marked skip for Sybil. import torch # noqa # logits = model(input_ids).logits # shape (batch, 2) for binary # T = fit_temperature(logits.cpu().numpy(), labels.cpu().numpy())["temperature"] # probs = torch.softmax(logits / T, dim=-1) ``` ### fp16 / bf16 numerics Mixed-precision inference (fp16, bf16) introduces small numerical noise in logits, which softmax amplifies in the tail. Empirically: ECE under bf16 is ~0.001–0.005 higher than fp32 on the same model, depending on batch size. Two implications: 1. **Calibrate at inference precision.** If production uses bf16, fit temperature on bf16 logits — not on fp32 logits cast back. 2. **Don't compare ECE across precision levels.** A 1 % ECE delta is well within fp16/bf16 noise for moderate-size eval sets. ### Calibration drift across checkpoints A transformer's calibration changes during fine-tuning even when its accuracy plateaus. Fit temperature *on the same checkpoint* you'll deploy; don't reuse a temperature from a previous epoch. [Reproducibility.md](reproducibility.md) discusses why the checkpoint hash should land in the manifest's `code_versions`. (calibration-pitfalls)= ## Pitfalls / Common mistakes - **Reporting ECE on uncalibrated logits.** ECE is only meaningful when scores are in [0, 1] and interpretable as P(y=1 | x). The toolkit's `expected_calibration_error*` functions raise `ValueError` if scores fall outside [0, 1] — apply softmax / sigmoid first. - **Picking n_bins arbitrarily.** Both equal-width and equal-mass ECE are sensitive to bin count. The toolkit defaults to 10; document and pin whatever you choose. Cross-paper comparisons require the same n_bins. - **Comparing L1 and L2 ECE numerically.** They're on different scales (L1 is bounded by 1; L2 by 1 too but typically smaller). Pick one, document the choice. - **Recalibrating on the test set.** Use a *validation* slice carved off the train fold. Recalibrating on test is a direct leakage of the metric you're about to report. - **Ignoring single-class slices.** ECE is degenerate when y is all-positive or all-negative. The toolkit's reliability_curve flags these with a `"skipped"` marker. ## Putting it all together ```python # Full calibration audit on a single slice. bs = brier_score(y, s_overconfident) ece = expected_calibration_error_l2_debiased(y, s_overconfident, n_bins=10) parts = brier_decomposition(y, s_overconfident, n_bins=10) curve = reliability_curve(y, s_overconfident, n_bins=10, strategy="quantile") print(f"Brier: {bs:.4f}") print(f" reliability={parts['reliability']:.4f} resolution={parts['resolution']:.4f}") print(f"L2-debiased ECE: {ece:.4f}") print(f"Equal-mass L1 ECE: {curve['ece_equal_mass']:.4f}") print(f" (quantile bins; n_bins={curve['n_bins']})") ``` ## Further reading - Guo, C. et al. *On Calibration of Modern Neural Networks.* ICML 2017. [arXiv:1706.04599](https://arxiv.org/abs/1706.04599) — temperature scaling, the canonical post-hoc method. - Naeini, M. P. et al. *Obtaining Well Calibrated Probabilities Using Bayesian Binning.* AAAI 2015. [arXiv:1411.0760](https://arxiv.org/abs/1411.0760) — ECE definition and the equal-mass-binning rationale. - Kumar, A., Liang, P., & Ma, T. *Verified Uncertainty Calibration.* NeurIPS 2019. [arXiv:1909.10155](https://arxiv.org/abs/1909.10155) — debiased ECE estimators. - Murphy, A. H. *A new vector partition of the probability score.* J. Appl. Meteorology 12, 1973 — the Brier decomposition. - Nixon, J. et al. *Measuring Calibration in Deep Learning.* CVPRW 2019. - sklearn: [`calibration_curve`](https://scikit-learn.org/stable/modules/generated/sklearn.calibration.calibration_curve.html), [`CalibratedClassifierCV`](https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html). See also: [thresholds.md](thresholds.md) (calibrate before `CostSensitiveSelector`), [comparison.md](comparison.md) (paired bootstrap on ECE differences).