Calibration battery composition — raw plus temperature plus isotonic interventions
Superseded on one or more axes by ADR-056. The body below retains its original prose per the ADR-073 immutability rule; the corrected position lives in the superseding ADR. See the Decisions index to navigate.
ADR-023: Calibration battery composition — raw plus temperature plus isotonic interventions
Status
Accepted (2026-05-15). New lock at the §3 Eval ledger row 343 level; not pre-locked by ADR-006 (which named the calibration-battery row but deferred composition).
Context
A classifier is calibrated if its predicted probabilities match observed frequencies — among rows where the model says “70 percent positive”, roughly 70 percent are positive (Guo et al. 2017). Modern transformer fine-tuning routinely produces miscalibrated outputs; sklearn LogisticRegression with class_weight=‘balanced’ produces a different miscalibration profile (typically under-confident on the positive class); LLM-judges at temperature=0 produce near-categorical softmax outputs with high concentration at the extremes; ProtectAI v1/v2 produce softmax outputs with unknown calibration character.
SPEC §3 Eval pre-named the calibration battery as ECE plus Brier plus reliability curve but did not specify (a) which ECE variant for headline; (b) bin count; (c) whether interventions are applied; (d) per-(rung, fold, seed) vs aggregate fitting.
Phase 0-04 walk Q5 surfaced that eval-toolkit ships a complete calibration battery — all 4 ECE variants (L1/L2 × plug-in/debiased), Brier plus Brier decomposition, reliability curves, and four calibrators (temperature, Platt, isotonic, beta). The choice question is therefore about reporting depth, not primitive availability. Brandon’s follow-up question — “are we also considering isotonic regression to calibrate the models?” — surfaced the lift-delta-chain framing: the before-after-temperature delta tells you “how much miscalibration is monotone-rescaling fixable” (Guo 2017 1-parameter fix); the before-after-isotonic delta tells you “how much miscalibration is arbitrary-monotone-remapping fixable” (high-capacity non-parametric fix). The temperature-to-isotonic gap is the methodology-informative quantity.
ADR-011 Guarantee 6 explicitly anticipates calibration interventions (“calibration-fit (temperature/isotonic) done on validation only”) so the intervention is spec-aligned.
Decision
Headline composition (PDF executive table)
| Column | Primitive | Notes |
|---|---|---|
| ECE-equal-mass | eval_toolkit.expected_calibration_error_equal_mass(n_bins=15, strategy=‘quantile’) | L1 plug-in equal-mass binning; Guo 2017 n_bins=15 standard; equal-mass binning preferred under class imbalance per eval-toolkit methodology/calibration.md guidance |
| Brier | eval_toolkit.brier_score | Proper scoring rule; no binning required |
Both computed on raw model outputs (no intervention applied). Per ADR-022 per-row-metric aggregation rule: pool rows across (fold, seed) per rung; compute once per rung; bootstrap CI per ADR-024 cross-fold methodology.
Spoke composition (WRITEUP/calibration.md)
Full 4-ECE matrix per rung — plug-in vs debiased (L1 and L2). The plug-in-vs-debiased delta surfaces how much binning bias affects the headline ECE; Kumar 2019 argues plug-in ECE systematically underestimates calibration error.
Brier decomposition per rung — refinement + reliability + uncertainty components via eval_toolkit.brier_decomposition. The reliability component is the “miscalibration” content of Brier; refinement is the “discrimination” content; the decomposition surfaces which component dominates per rung.
Reliability diagrams per rung — equal-mass quantile binning (eval_toolkit.reliability_curve(strategy=‘quantile’)); visual canonical of the ECE story.
Intervention deltas per rung — three calibration states reported in a small table:
| State | Calibrator | ECE-equal-mass | Brier |
|---|---|---|---|
| Raw | (none) | value | value |
| Temperature-scaled | eval_toolkit.fit_temperature on val per-(rung, fold, seed) | value | value |
| Isotonic-regressed | eval_toolkit.fit_isotonic_calibrator on val per-(rung, fold, seed) | value | value |
Interventions — fitting protocol
| Sub-decision | Locked policy |
|---|---|
| Calibrators applied | Temperature (Guo 2017 1-parameter logit-scaling) + Isotonic (sklearn IsotonicRegression-wrapped, non-parametric monotonic remapping) |
| Calibrators NOT applied | Platt (deferred to afterword — legacy SVM-margin convention; minor lift over temperature on transformer outputs); beta (deferred to afterword — boundary-niche; useful when scores cluster at 0/1) |
| Fitting data | Validation split per ADR-016’s per-LODO-fold val (per ADR-011 Guarantee 6 — no test-set leakage) |
| Per-rung fitting granularity | Per-(rung, fold, seed) yielding 12 trained calibrators per rung × 2 interventions; per-(rung, fold) yielding 4 reference calibrators per reference rung × 2 interventions |
| Test-set application | Calibrated scores computed by applying the val-fit calibrator to test rows; ECE plus Brier re-computed |
| Rank-based metric preservation | Calibration interventions are monotonic by construction → PR-AUC, ROC-AUC, and recall@FPR are unchanged by intervention; this methodology subtlety noted in the spoke |
| Per-slice calibration after intervention | Deferred unless reviewer asks (per-slice n too small for stable temperature fits) |
| Maximum-calibration-error (worst-bin) | Computed via eval_toolkit.maximum_calibration_error; dumped to evals/calibration/per_obs_audit.parquet; not headline or spoke (audit-only) |
Phase 1 deliverables
- src/eval/calibration_battery.py — orchestrates raw-plus-temperature-plus-isotonic computation per rung
- evals/calibration/
__fold __seed __intervention<temperature|isotonic|raw>.json — per-(rung, fold, seed, intervention) ECE plus Brier values - evals/calibration/per_obs_audit.parquet — per-(rung, fold, seed) audit rows including maximum-calibration-error
- WRITEUP/calibration.md spoke filename pre-committed
- WRITEUP/methodology.md gains a one-paragraph subsection “Calibration interventions are monotonic and therefore don’t change rank-based metrics”
Consequences
Positive
- Headline carries Guo 2017-standard ECE (PromptShield-comparable for cross-paper comparison) plus Brier (proper-scoring-rule baseline that doesn’t require binning) — 2 columns per rung
- Spoke surfaces methodology depth — 4 ECE variants reveals plug-in-vs-debiased bias; Brier decomposition reveals refinement-vs-reliability split; intervention deltas reveal temperature-vs-isotonic gap
- Library-first — all primitives from eval-toolkit; no hand-rolling
- Compute essentially free — calibrator fits are CPU-seconds per (rung, fold, seed)
- Aligns with ADR-011 Guarantee 6 (validation-only fitting) — calibrator-fit-on-val-applied-to-test is exactly the locked pattern
- Aligns with ADR-005 Principle 2 (honest evaluation preferred) — surfaces miscalibration honestly rather than reporting only post-intervention numbers
Negative
- WRITEUP/calibration.md is a new spoke; one more Phase 5 deliverable
- Per-(rung, fold, seed) calibrator artifacts (96 trained calibrators + 32 reference = 128 JSON files); modest persistence overhead
- Methodology subtlety about “monotonic intervention preserves ranks” needs a clear paragraph for A2 reviewer (otherwise reviewer may wonder why PR-AUC isn’t reported after intervention)
Neutral
- Maximum-calibration-error computed alongside but not surfaced — reviewer can still see worst-bin behavior in audit parquet if curious; aligns with audit-trail discipline
Alternatives considered
- Option A (Raw only, no interventions): simplest; rejected because doesn’t surface how much miscalibration is correctable by simple post-hoc fix — methodology contribution lost
- Option B (Raw plus temperature only): Guo 2017 standard; rejected because no high-capacity comparison point; temperature-vs-isotonic gap is the methodology-informative quantity
- Option D (Full 4-calibrator battery — temperature, Platt, isotonic, beta): most rigorous; rejected because Platt and beta calibrators show similar behavior to temperature and isotonic respectively in preliminary characterization; deferred to afterword if reviewer asks
- Per-rung pooled (rather than per-fold-seed) calibrator fitting: simpler; rejected because breaks the per-(fold, seed)-paired-across-rungs structure locked by ADR-022; would also leak val rows from one fold into another fold’s calibrator
- Per-slice calibration after intervention: most granular; rejected because per-slice n less than or equal to 1054 is too small for stable temperature fits; deferred unless reviewer asks
References
See frontmatter references list. Primary anchors — Kumar, Liang and Ma 2019 (debiased ECE estimators, arXiv:1909.10155); Guo et al. 2017 (transformer calibration plus temperature scaling, arXiv:1706.04599); Naeini, Cooper and Hauskrecht 2015 (Bayesian binning and ECE history, AAAI); Niculescu-Mizil and Caruana 2005 (calibrator comparison — Platt vs isotonic); Kull, Silva Filho and Flach 2017 (beta calibration, JMLR); eval-toolkit methodology/calibration.md (local — equal-mass binning recommendation under class imbalance); ADR-006 (calibration-battery row pre-named); ADR-011 Guarantee 6 (validation-only fitting); ADR-022 (multi-seed protocol details — per-(rung, fold, seed) calibrator fitting consistency).
Transcript
See transcripts/2026-05-15__phase-0-04__eval-framework.md for the conversation that led to this decision.