---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

# Worked example: calibration with Platt + isotonic

> **What this shows.** A miscalibrated scorer (uncalibrated logits-like
> values), the resulting ECE, then Platt + isotonic recalibration
> reducing ECE. The two-step "fit on dev / apply on test" pattern.
>
> **Runtime:** ~1 s. Pure-numpy core, no optional dependencies.

## Setup

```{code-cell}
import numpy as np
from eval_toolkit import (
    fit_platt_calibrator,
    fit_isotonic_calibrator,
    expected_calibration_error,
    set_global_seeds,
)
set_global_seeds(42)
```

## Synthetic miscalibrated scorer

Build a scenario where the scorer's outputs are *informative* (high AUC)
but *miscalibrated* — e.g., the scorer outputs values shifted toward
0.7 even on negatives. Real-world causes: SVMs producing decision-margin
values, neural networks before sigmoid calibration, etc.

```{code-cell}
rng = np.random.default_rng(42)
n_dev, n_test = 500, 500

# True probability is 0.3 (mostly negatives); scorer outputs are skewed high
y_dev = (rng.uniform(0, 1, n_dev) < 0.3).astype(int)
y_test = (rng.uniform(0, 1, n_test) < 0.3).astype(int)

def _miscalibrated(y, rng):
    # Discriminative (informative for ranking) but biased high
    return np.clip(0.7 + 0.2 * (y - 0.5) + rng.normal(0, 0.1, size=len(y)), 0.0, 1.0)

s_dev = _miscalibrated(y_dev, rng)
s_test = _miscalibrated(y_test, rng)
```

## Before calibration: high ECE

`expected_calibration_error` bins predictions and measures
``|accuracy(bin) - confidence(bin)|`` averaged across bins. Miscalibrated
scores produce high ECE despite potentially good ranking metrics:

```{code-cell}
ece_uncal = expected_calibration_error(y_test, s_test, n_bins=10)
print(f"ECE (uncalibrated): {ece_uncal:.3f}")
assert ece_uncal > 0.2, "expected the synthetic scorer to be visibly miscalibrated"
```

## Platt calibration (parametric, fast)

Platt fits a sigmoid ``σ(a·s + b)`` to the labels via maximum likelihood
with Lin's 2007 Laplace-smoothed targets. Two scalar parameters → fast,
robust on small dev sets:

```{code-cell}
platt = fit_platt_calibrator(y_dev, s_dev)
s_test_platt = platt(s_test)
ece_platt = expected_calibration_error(y_test, s_test_platt, n_bins=10)
print(f"ECE (Platt):        {ece_platt:.3f}  (a={platt.a:.3f}, b={platt.b:.3f})")
assert ece_platt < ece_uncal, "Platt calibration should reduce ECE"
```

## Isotonic regression (non-parametric, flexible)

Isotonic fits a monotone step function via PAVA (pool-adjacent-violators).
More flexible than Platt but needs more dev data to avoid overfitting:

```{code-cell}
isotonic = fit_isotonic_calibrator(y_dev, s_dev)
s_test_iso = isotonic(s_test)
ece_iso = expected_calibration_error(y_test, s_test_iso, n_bins=10)
print(f"ECE (isotonic):     {ece_iso:.3f}")
assert ece_iso < ece_uncal, "isotonic calibration should reduce ECE"
```

## Picking between Platt and isotonic

A practical rule of thumb (Niculescu-Mizil & Caruana 2005):

- **Platt** when the dev set is small (<1000) or the miscalibration is
  approximately monotone-sigmoid (common for SVMs, log-reg with mild
  shift, NNs with temperature drift).
- **Isotonic** when the dev set is large (>1000) and the miscalibration
  is non-monotone (e.g., random forests with bimodal output).

Both produce a `__call__`-able fit object so they're interchangeable in
calibration pipelines. The toolkit also ships
`fit_beta_calibrator` (Kull & Filho 2017) for the parametric beta-family
when sigmoid is too restrictive.

## See also

- [`calibration.py` reference](../api/calibration.md) — full list:
  Platt, isotonic, beta, temperature scaling.
- [`metrics.py` reference](../api/metrics.md) — ECE variants
  (`expected_calibration_error_debiased`,
  `expected_calibration_error_l2`, etc.) and `brier_decomposition`.
- [Metrics + bootstrap example](metrics_and_bootstrap.md) — wrap any of
  these calibrated outputs in a CI.