Evaluation design
Deep-dive reference for the methodology in WRITEUP_PAPER.md (academic) and WRITEUP_NARRATIVE.md (narrative). Pick a guide for the cover narrative; this spoke goes deeper.
How to read this spoke: For a fast skim, focus on the bolded Result subsections + the final §Summary if present. For a full audit, read the methodology paragraphs + the ADR references in headers.
- Headline metric battery: PR-AUC (primary; canonical for class-imbalanced ranking) + ROC-AUC (secondary; cross-paper) + recall@FPR at 0.1 %, 1 %, 5 % (1 % is canonical per PromptShield 2025) + ECE + Brier. BCa bootstrap CIs throughout.
- Statistical stance: report effect sizes and CIs. Do not rely on p-values. Modern preference in applied ML evaluation; aligned with eval-toolkit’s primitive design.
- Paired-bootstrap: rung-vs-rung Δ-CIs preserve per-row pairing correlation without parametric assumptions (DeLong’s, McNemar’s). MDE reported alongside any CI that includes zero.
- Per-source breakdowns: mandatory for any OOD claim. The 5-slice OOD slate (
notinject/xstest/jbb_behaviors/bipia/injecagent) probes false-positive robustness + cross-distribution + adversarial-elicitation + indirect injection + agentic-flow generalization. - Single-class slice convention: BIPIA + InjecAgent are all-positive; NotInject is all-negative. AUROC + AUPRC are mathematically undefined on single-class slices; the metrics pipeline filters these out at source per ADR-006.
This spoke covers §5.1, §5.2, §5.4, §5.5 — the evaluation framework, statistical apparatus, per-source breakdown discipline, and OOD slate composition. Threshold policy (§5.3) is its own spoke (threshold-policy.md); adversarial robustness scope (§5.6) lives in the reference-scorer audit spoke (reference-scorer-audit.md) since it co-locates with the threat-model narrative.
This section is the heart of the writeup. Every test below is reported with effect sizes and CIs — never p-values. The choice is methodological: in finite-sample settings, what is the effect and how confident is the estimate is the answerable question; is this nonzero at α=0.05 is a question whose answer depends on the sample size more than the phenomenon.
5.1 Headline descriptive metrics
The headline metric battery reports with BCa bootstrap CIs:
- PR-AUC — the most relevant ranking metric for class-imbalanced tasks where precision and recall both matter. F1 alone is misleading at any chosen threshold; PR-AUC integrates over thresholds.
- ROC-AUC — reported alongside for class-prior-independent ranking. Less useful than PR-AUC under this task’s class priors but standard for cross-paper comparison.
- recall@FPR ∈ {0.1 %, 1 %, 5 %} — operational pinpoints. The 1 % point is the canonical reporting threshold (PromptShield 2025). The 0.1 % point is included in
evals/metrics/per_cell.parquetper ADR-021 + ADR-023 volatility-surface protocol but is noisy at this project’s sample sizes and not surfaced in headlines. - ECE (equal-mass + Kumar-2019 debiased) + Brier — calibration; see §5.2 calibration battery below.
See eval-toolkit comparison methodology (see README) for why each metric is preferred over plain F1.
5.2 Statistical tests
Result (Stance): report effect sizes and CIs. Do not rely on p-values. This is the modern preference in applied ML evaluation and is aligned with eval-toolkit’s primitive design.
Per-metric bootstrap CIs — bootstrap_ci
Why: a point estimate of PR-AUC hides finite-sample variance. Without a CI, claiming rung A beats rung B is irresponsible — the gap may be smaller than the sampling noise. Per-row resampling preserves label distribution and avoids parametric assumptions.
Method: BCa bootstrap (Efron 1987 / Efron & Tibshirani 1993 §14); resample budget per eval-toolkit guidance: n=200 sanity / n=1000 default / n=5000 publication-grade / n=10K+ only for expensive metrics. Pinned seed; stability check at a second seed flags instability if per-fold CI shifts > 0.01. Report the point estimate, not the resample mean — BootstrapCI.point_estimate is the metric on the original data. See eval-toolkit bootstrap methodology (see README).
Paired-bootstrap differences for rung-vs-rung — paired_bootstrap_diff
Why: when two rungs are evaluated on the same test set, their per-row errors are correlated. Paired bootstrap accounts for that correlation without requiring parametric assumptions like DeLong’s. One primitive covers AUC differences, recall@FPR differences, and threshold-based differences uniformly — no need to mix DeLong + McNemar + permutation tests. Non-overlapping CIs imply significance; overlap does NOT imply non-significance — always compute the difference CI.
Method: per-row pairing; matched resamples; CI on the paired Δ. Reported wherever a comparative claim is made. Specialised variants paired_bootstrap_ece_diff (ECE comparisons) and paired_bootstrap_op_point_diff (two-level bootstrap for threshold refitting) handle non-AUC paired metrics. delong_roc_variance is available for sanity-check parametric ROC-AUC CIs (DeLong et al. 1988).
MDE — mde_from_ci
Why: a wide CI that excludes “no difference” is still informative; a wide CI that includes “no difference” can mean either “the rungs are equivalent” or “the test lacks power to tell”. MDE distinguishes these. A claim of equivalence requires MDE small enough to rule out the smallest meaningful difference.
Method: derive MDE from CI width at α=0.05, power=0.80. Report alongside every CI that includes zero.
Calibration battery
reliability_curve + fit_temperature + fit_isotonic_calibrator + fit_platt_calibrator + fit_beta_calibrator + ECE variants + Brier.
Why: even without a deployment goal, calibration tells you whether the scores mean what they claim. A score of 0.9 should fire injections ~90 % of the time. ECE quantifies the gap; Brier is a proper scoring rule that decomposes as BS = REL − RES + UNC (Murphy 1973), so two models with same Brier may have very different operational profiles. Reliability curves diagnose where miscalibration concentrates (over-confident on the cleanest? under-confident on the most ambiguous?). Temperature (Guo et al. 2017 ICML; single-parameter logit scaling; argmax-invariant), isotonic, Platt (1999), and Beta scaling are the standard post-hoc repairs, fit on validation only.
ECE choice matters: prefer L2-debiased ECE (Kumar et al. 2019, arXiv:1909.10155) for headline reporting — preserves rank ordering and removes small-sample inflation (expected_calibration_error_l2_debiased). Equal-mass ECE (expected_calibration_error_equal_mass) is more robust under class imbalance via quantile binning (Naeini et al. 2015, arXiv:1411.0760). Pin n_bins across comparisons — ECE is a binned estimator; small bin counts understate, large bin counts overstate.
See eval-toolkit calibration methodology (see README).
CV-CLT CI for cross-fold variance — cv_clt_ci
Why: when source-disjoint k-fold is run as a supplement, per-fold metrics are not independent — train sets overlap across folds. A naive standard-error treatment overstates confidence. CLT-based CI with Bayle 2020 (Annals of Statistics) Theorem 3.1 correction handles the dependence properly. See eval-toolkit splits methodology (see README).
LODO non-exchangeability is a real assumption violation per assumption A-008 — the cv_clt_ci primitive was derived for exchangeable k-fold. Reporting consequence: evals/audit/cross_fold_ci_audit.parquet reports both cv_clt_ci CI and block-bootstrap-on-folds CI per rung; if the ratio block_bootstrap_CI_halfwidth / cv_clt_CI_halfwidth > 1.5, the spoke flags “LODO non-exchangeability dominates within-fold variance” — turning the assumption violation into a named methodology finding.
Multi-comparison correction — bh_fdr_correct
Why: when comparing many rung-pairs simultaneously, family-wise error inflates. Benjamini-Hochberg FDR (BH 1995) is preferred over Bonferroni for power reasons in correlated-test families.
Evidence gates — release-time go/no-go
Why: claims at submission need machine-checkable gates, not implicit confidence. eval-toolkit’s claims.md provides composable gates: metric_threshold_gate, low_fpr_feasibility_gate, paired_diff_present_gate, no_leakage_errors_gate, etc. The ClaimSpec → GateResult → ClaimReport pipeline (v0.9+) gives a release-gate manifest a reviewer can audit.
§5.3 (Threshold policy — dual detection + verification) lives in its own spoke at threshold-policy.md per the locked sub-spoke structure in ADR-025.
5.4 Per-source / per-style breakdown
Why: aggregate metrics hide heterogeneity. A 0.95 average PR-AUC can mask a 0.6 PR-AUC on one source that is in fact the source you care about. Per-source breakdowns are mandatory for any OOD claim because OOD is defined by which source the test rows came from.
The project also ships a per-attack-style heuristic tagger (regex-based; conservative). Tagger coverage on the LODO training pool is not exhaustively measured in this submission — the tagger is used at data-audit time per ADR-041 to spot-check coverage of the four attack-source slates; per-row tag → per-cell coverage rates ARE in evals/data_audit.json per-source breakdowns. See ../EVIDENCE.md §3.
LLM-as-rater rubric audit was originally locked at Phase 0 per ADR-018; DROPPED at Phase 4 cost re-estimation per ADR-050. The 50-pair LLM-pre-labelled dedup-calibration holdout (data/dedup_holdout.jsonl) is the partial LLM-judge audit that survived.
5.5 OOD slate
The 5-slice OOD slate per ADR-021 + ADR-016, populated at Phase 0-04 from docs/research/benchmarks/ candidate set:
| Slice | Source | Class composition | Probe target | Why chosen |
|---|---|---|---|---|
notinject |
HF Hub wikd/NotInject (SHA pinned per source_manifest.yaml) |
All-negative (benign-but-injection-like) | False-positive robustness on injection-shaped benign | Tests whether classifier discriminates intent from form |
xstest |
HF Hub paul-rottger/xstest-v2-copy |
Both classes (safe/unsafe instructions) | Cross-distribution shift to jailbreak-as-question | Tests against an actively-different distribution from training |
jbb_behaviors |
HF Hub JailbreakBench/JBB-Behaviors |
Both classes (harmful behavior elicitations + benign refusal) | Adversarial-elicitation generalization | Canonical jailbreak benchmark; community-recognized |
bipia |
Local git repo (release-pinned SHA in source_manifest.yaml) | All-positive (indirect prompt injection via email body) | Indirect injection generalization | Tests indirect-injection (BIPIA paper benchmark) |
injecagent |
Local git repo (release-pinned SHA in source_manifest.yaml) | All-positive (multi-turn agentic injection) | Agentic-flow generalization | Tests agentic-flow injection |
See eval-toolkit splits methodology (see README) for the source-disjoint discipline this project applies.
Result (Single-class slice convention): BIPIA + InjecAgent are all-positive; NotInject is all-negative. AUROC and AUPRC are mathematically undefined on single-class slices. The metrics pipeline filters these slices out of AUROC / AUPRC artifacts at source (per Item 4 of the v1.0.0 closure sweep — see WRITEUP_PAPER §6.2 Reference scorer contamination and WRITEUP/limitations-and-future-work §8.2); per-slice recall-at-threshold is reported on single-class slices instead.
Cross-references
- Threshold policy (§5.3 dual-policy detection + verification) →
threshold-policy.md - Reference-scorer contamination audit + adversarial robustness scope →
reference-scorer-audit.md - Data splits + LODO + leakage discipline →
data-decisions.md - Methodology guarantees + library tooling →
methodology-guarantees.md - Headline results (interpretation): WRITEUP_PAPER §4 (academic) or WRITEUP_NARRATIVE Act 3 (narrative)
- Headline tables (data): RESULTS §1
Linked ADRs: ADR-006 (headline metrics + statistical apparatus), ADR-021 (slice aggregation + recall@FPR pinpoints), ADR-022 (paired- bootstrap protocol), ADR-023 (calibration battery), ADR-024 (cross- fold CI methodology), ADR-046 (Phase 4 analysis bundle).