Library-first (per ADR-005 + decisions/library_imports.md + eval-toolkit v0.39.0 surface): eval_toolkit.bootstrap.bootstrap_ci + project glue at src/eval/marginal_bootstrap.py already produced evals/bootstrap/marginal_cells.parquet; this notebook is the READER, not the regenerator. Source-of-truth lives in evals/.
Per ADR-006 + WRITEUP/eval-design.md §5.1, AUPRC is the primary ranking metric under class imbalance. The random-predictor AUPRC equals the positive prevalence on each slice (NOT 0.5 — that’s AUROC’s chance baseline).
AUROC reported for cross-paper comparison per ADR-006. AUROC’s chance baseline is 0.5 regardless of prevalence; under class imbalance AUROC over-states performance vs AUPRC. Use AUPRC for primary interpretation.
Prevalence baselines (slice-by-slice random-predictor floor for AUPRC)
AUPRC’s random-predictor floor equals the positive class prevalence on each slice (NOT 0.5). When a rung’s AUPRC is below the prevalence baseline, the rung’s ranking is anti-correlated with the label on that slice.
None of the rungs clears the pooled_ood prevalence baseline (0.374) under AUPRC. Best is frozen-probe at 0.364; CI upper bound 0.375 just touches the baseline.
LoRA HURTS OOD vs frozen-probe (-0.071 AUPRC; paired-bootstrap CI clears zero). Fine-tuning the head on the LODO direct-injection pool actively degrades cross-family OOD.
ProtectAI v1 → v2 is non-monotone: v2 beats v1 on jbb_behaviors (+0.037) but loses on xstest (-0.087). Newer version does not uniformly improve.
On in-domain slices (jbb_behaviors + xstest) all trained rungs clear the prevalence baseline; the wall is OOD-specific.
Cross-family framing: training pool is 4 direct-injection sources; the 5-slice OOD slate probes attack types absent from training (indirect injection via BIPIA email-body, multi-turn agentic-flow via InjecAgent, jailbreaks via JBB / XSTest, false-positive-probe via NotInject). The OOD wall is cross-FAMILY, not cross-source.
Reference: see also
RESULTS.md — full 5-rung × 5-slice grid with N/A markers on single-class cells; reviewer-facing figures F1-F5; raw-data blob URLs at tree/v1.0.5/evals/.
02_frozen_vs_lora.ipynb — paired-bootstrap rung-comparison (the source for the LoRA -0.071 delta + CI).
03_calibration.ipynb — reliability + ECE per rung.
04_ood_slate.ipynb — per-slice IID-vs-OOD gap visualization.