Per ADR-023, the calibration audit reports: - ECE equal-mass (10 bins; per-cell) - Brier score (per-cell) - Reliability curve (rendered as F4 reliability triptych in docs/plots/F4.svg; see RESULTS §4) - Temperature + isotonic fitting (validation-only per ADR-023)
Platt + Beta calibrators were deferred per ADR-023 original scope. v1.0.6 filed eval-toolkit#43 (library-first); v1.0.8 will consume upstream when shipped. See NEXT_STEPS §1.4.
This notebook aggregates the per-cell ECE + Brier from evals/metrics/per_cell.parquet and surfaces the per-rung calibration story.
from pathlib import Pathimport pandas as pdREPO_ROOT = Path.cwd().resolve()whilenot (REPO_ROOT /"pyproject.toml").exists() and REPO_ROOT != REPO_ROOT.parent: REPO_ROOT = REPO_ROOT.parentPER_CELL_PARQUET = REPO_ROOT /"evals"/"metrics"/"per_cell.parquet"F4_FIGURE = REPO_ROOT /"docs"/"plots"/"F4.svg"per_cell = pd.read_parquet(PER_CELL_PARQUET)print(f"per_cell.parquet: {per_cell.shape}")print(f"calibration columns: {[c for c in per_cell.columns if'ece'in c or'brier'in c]}")
Per WRITEUP/eval-design.md §5.1 + WRITEUP/methodology-guarantees.md:
ECE equal-mass is the headline calibration metric (10 equal-mass bins per ADR-023; debiased variants surfaced in calibration_battery.py but only equal-mass exported to per_cell.parquet).
Brier is reported as a strictly-proper-scoring-rule sanity check; lower is better.
Reliability curves rendered per rung in F4 (triptych): raw + temperature-scaled + isotonic-fitted reliability diagrams. See docs/plots/F4.svg + RESULTS §4.
Calibration trend by rung:
print("Per-rung mean calibration (averaged across multi-class slices):")rung_means = ( calib_agg.groupby("rung") .agg( mean_ece=("ece_mean", "mean"), mean_brier=("brier_mean", "mean"), ) .reindex(RUNG_ORDER) .round(4))print(rung_means.to_string())
print()print("Verification: lower-is-better; protectai rungs have higher ECE/Brier in cells with sample sizes ≥1, since they emit non-calibrated logits without per-fold val tuning.")
Verification: lower-is-better; protectai rungs have higher ECE/Brier in cells with sample sizes ≥1, since they emit non-calibrated logits without per-fold val tuning.
F4 figure reference
print(f"F4 reliability triptych SVG: {F4_FIGURE}")print(f" Exists: {F4_FIGURE.exists()}")print(f" Size: {F4_FIGURE.stat().st_size if F4_FIGURE.exists() else'-'} bytes")print(" See RESULTS.md §4 for the embedded version + WRITEUP/eval-design.md §5.1 for the methodology.")
F4 reliability triptych SVG: /home/brandon_behring/Claude/prompt-injection-detection-submission/docs/plots/F4.svg
Exists: True
Size: 95609 bytes
See RESULTS.md §4 for the embedded version + WRITEUP/eval-design.md §5.1 for the methodology.