Worked example: calibration with Platt + isotonic#

What this shows. A miscalibrated scorer (uncalibrated logits-like values), the resulting ECE, then Platt + isotonic recalibration reducing ECE. The two-step “fit on dev / apply on test” pattern.

Runtime: ~1 s. Pure-numpy core, no optional dependencies.

Setup#

import numpy as np
from eval_toolkit import (
    fit_platt_calibrator,
    fit_isotonic_calibrator,
    expected_calibration_error,
    set_global_seeds,
)
set_global_seeds(42)

Synthetic miscalibrated scorer#

Build a scenario where the scorer’s outputs are informative (high AUC) but miscalibrated — e.g., the scorer outputs values shifted toward 0.7 even on negatives. Real-world causes: SVMs producing decision-margin values, neural networks before sigmoid calibration, etc.

rng = np.random.default_rng(42)
n_dev, n_test = 500, 500

# True probability is 0.3 (mostly negatives); scorer outputs are skewed high
y_dev = (rng.uniform(0, 1, n_dev) < 0.3).astype(int)
y_test = (rng.uniform(0, 1, n_test) < 0.3).astype(int)

def _miscalibrated(y, rng):
    # Discriminative (informative for ranking) but biased high
    return np.clip(0.7 + 0.2 * (y - 0.5) + rng.normal(0, 0.1, size=len(y)), 0.0, 1.0)

s_dev = _miscalibrated(y_dev, rng)
s_test = _miscalibrated(y_test, rng)

Before calibration: high ECE#

expected_calibration_error bins predictions and measures |accuracy(bin) - confidence(bin)| averaged across bins. Miscalibrated scores produce high ECE despite potentially good ranking metrics:

ece_uncal = expected_calibration_error(y_test, s_test, n_bins=10)
print(f"ECE (uncalibrated): {ece_uncal:.3f}")
assert ece_uncal > 0.2, "expected the synthetic scorer to be visibly miscalibrated"
ECE (uncalibrated): 0.363

Platt calibration (parametric, fast)#

Platt fits a sigmoid σ(a·s + b) to the labels via maximum likelihood with Lin’s 2007 Laplace-smoothed targets. Two scalar parameters → fast, robust on small dev sets:

platt = fit_platt_calibrator(y_dev, s_dev)
s_test_platt = platt(s_test)
ece_platt = expected_calibration_error(y_test, s_test_platt, n_bins=10)
print(f"ECE (Platt):        {ece_platt:.3f}  (a={platt.a:.3f}, b={platt.b:.3f})")
assert ece_platt < ece_uncal, "Platt calibration should reduce ECE"
ECE (Platt):        0.061  (a=17.319, b=-12.925)

Isotonic regression (non-parametric, flexible)#

Isotonic fits a monotone step function via PAVA (pool-adjacent-violators). More flexible than Platt but needs more dev data to avoid overfitting:

isotonic = fit_isotonic_calibrator(y_dev, s_dev)
s_test_iso = isotonic(s_test)
ece_iso = expected_calibration_error(y_test, s_test_iso, n_bins=10)
print(f"ECE (isotonic):     {ece_iso:.3f}")
assert ece_iso < ece_uncal, "isotonic calibration should reduce ECE"
ECE (isotonic):     0.046

Picking between Platt and isotonic#

A practical rule of thumb (Niculescu-Mizil & Caruana 2005):

  • Platt when the dev set is small (<1000) or the miscalibration is approximately monotone-sigmoid (common for SVMs, log-reg with mild shift, NNs with temperature drift).

  • Isotonic when the dev set is large (>1000) and the miscalibration is non-monotone (e.g., random forests with bimodal output).

Both produce a __call__-able fit object so they’re interchangeable in calibration pipelines. The toolkit also ships fit_beta_calibrator (Kull & Filho 2017) for the parametric beta-family when sigmoid is too restrictive.

See also#