Dual-policy threshold characterization at symmetric 1% cost weights
ADR-025: Dual-policy threshold characterization at symmetric 1% cost weights
Status
Accepted (2026-05-15). Closes the last [OPEN] row in SPEC_GREENFIELD §4 Threshold (row 347 — Cost-weight targets). Anchored on pre-locks from ADR-006 (rejection of cost-weighted thresholding plus scenario-based replacement) plus SPEC §4 dual-policy framing plus SPEC §4 dual-policy applicability lock plus ADR-022 per-(seed) threshold protocol plus ADR-021 recall@FPR pinpoint triad.
Context
ADR-006 pre-locked the threshold-handling stance — cost-weighted thresholding rejected as false precision, replaced by qualitative scenario discussion in WRITEUP/threshold-policy.md covering ≥3 deployment scenarios. SPEC §4 then locked the framing — same scores serve two operational contexts (Detection — catch injections, FN-expensive; Verification — confirm clean text, FP-expensive) under different cost weights, with selection on validation only via eval-toolkit’s ThresholdSelector protocol.
What remained genuinely [OPEN] at row 347 was the numeric values for the two policy budgets (FPR target for detection; FNR target for verification), the aggregation surface for selector fitting, the headline-vs-spoke reporting layout, and the protocol for handling per-rung infeasibility when targets are unreachable on a val slice.
eval-toolkit primitives confirmed at decision time (fresh-investigation per CLAUDE.md):
TargetFPRSelector(t_fpr).select(y_val, s_val)— picks smallest threshold s.t. FPR(val) ≤ t_fpr → max recall(val) feasible point (eval_toolkit/thresholds.py#L109)TargetRecallSelector(r_target).select(y_val, s_val)— picks highest threshold s.t. recall(val) ≥ r_target → min FPR(val) feasible point — Lipton-Elkan 2014 §3 most-precise convention (eval_toolkit/thresholds.py#L71)paired_bootstrap_op_point_diff(val_y, val_score_a, val_score_b, test_y, test_score_a, test_score_b, threshold_fn, metric_fn, n_resamples, seed)— two-level bootstrap (refit threshold per val resample, apply on test resample, compute paired diff) — purpose-built for the per-(seed) refit-per-resample protocol (eval_toolkitdocs/methodology/thresholds.md#L181-L218; doc-path no longer resolves on upstream main as of v0.43.0 — methodology consolidated into the package’s docstrings; seepaired_bootstrap_op_point_diffin eval-toolkit for current API)CostSensitiveSelector— explicitly not used per ADR-006 cost-weighted-thresholding rejection
Decision
Q1 — Numeric cost-weight targets (ledger row 347)
Symmetric 1% on both policy budgets.
| Policy | Budget | Selector primitive | Operational reading |
|---|---|---|---|
| Detection | FPR ≤ 1% on validation | TargetFPRSelector(0.01) |
“Catch as many injections as possible while keeping false-alarm rate at or below 1%” |
| Verification | FNR ≤ 1% on validation (equivalently recall ≥ 99%) | TargetRecallSelector(0.99) |
“Flag as little clean text as possible while missing at most 1% of injections” |
Symmetry rationale — (a) the detection-policy operating point coincides numerically with the recall@FPR=1% headline pinpoint locked in ADR-021, so headline integration is free (relabel-in-place footnote, not a new column); (b) symmetric framing keeps the WRITEUP narrative readable as “two budget orientations on a shared 1% cost weight” with no asymmetric-rationale paragraph required; (c) matches eval-toolkit methodology/thresholds.md documented default and SPEC §4 default-if-unsure; (d) PromptShield 2024-2025 precedent uses FPR=1% as the canonical operating point on the detection side.
Q2 — Aggregation surface (per-(rung, fold, seed))
Per-(rung, fold, seed) selector fitting on the validation split. 4 LODO folds × 3 seeds × 2 policies = 24 thresholds per trained rung; 96 threshold-pair instances total across 4 trained rungs. Selection variance is propagated via paired_bootstrap_op_point_diff two-level bootstrap consistent with ADR-022’s per-(seed) threshold protocol for recall@FPR pinpoints.
eval-toolkit methodology/thresholds.md §“When to refit threshold per bootstrap resample” justifies the two-level approach when “threshold-selection rule has meaningful variance (small slices, noisy PR curves) AND you’re reporting operating-point metrics.” Per-fold val n ≈ 250-1300 rows is exactly that regime.
Reference scorers (4 untrained rungs — ProtectAI v1 + v2 + gpt-4o + claude-sonnet-4-6) are excluded from dual-policy fitting per SPEC §4 dual-policy applicability lock; they report recall@FPR pinpoints only with contamination caveats per ADR-018.
Q3 — Reporting layout (1 new headline column + relabel + spoke grid)
Headline footprint per trained rung after Phase 0-05 close:
| Rung | AUPRC | AUROC | R@FPR=0.1%* | R@FPR=1%† | R@FPR=5% | FPR@R≥99%† | ECE | Brier |
Where: - * (carried over from ADR-021) — volatility-flagged 0.1% pinpoint at pooled aggregation only - † (new in ADR-025) — dual-policy operating points (detection on R@FPR=1%; verification on FPR@R≥99%)
The detection-policy column collapses to the existing recall@FPR=1% pinpoint (no new data, just a footnote labeling it as the detection-policy operating point per ADR-025); the verification-policy column is one new column (“FPR @ recall ≥ 99%”) per trained rung — total 4 new headline cells (one per trained rung × 1 column). Reference rungs receive blank cells in this column with footnote pointing to the dual-policy applicability lock.
Spoke at WRITEUP/threshold-policy.md carries:
- Dual-policy framing explainer (Detection vs Verification budgets; same scores, two anchor operating points along the ROC curve)
- Dual-policy operating-point grid — 4 trained rungs × 2 policies × {pooled-IID + pooled-OOD + 4 per-LODO-fold + 5 per-OOD-slice} aggregation levels = 80 cells per policy, with paired_bootstrap_op_point_diff CIs at every cell
- Verification-target reachability across trained rungs subsection (Q4 surface)
- ≥3 deployment scenarios per ADR-006 (e.g., agentic tool-use with catastrophic-miss cost; user-facing chat with friction-dominant cost; throughput-dominant triage)
- Reference-scorer caveat — dual-policy doesn’t apply per SPEC §4 lock; reference rungs report recall@FPR pinpoints only
- Recall-floor sensitivity sweep afterword (optional Phase 5 deliverable; pre-committed via Q4 persistence pre-commit; covered in Consequences below)
Q4 — Infeasibility handling (honest reporting + audit JSON + spoke subsection + persistence pre-commit)
When TargetRecallSelector(0.99) cannot satisfy recall ≥ 99% on a (rung, fold, seed) val slice (PR-curve too noisy, plateau below 99%, or score quantization gaps), the reporting protocol pre-commits four surfaces:
| Surface | Implementation |
|---|---|
| Cell rendering | Asterisk (*) on cells where target unreachable |
| Footnote | “Cells marked * indicate the verification target (recall ≥ 99%) was unreachable on the val slice; threshold falls back to the lowest available; achieved val-recall is reported in evals/audit/verification_reachability.json” |
| Audit JSON | evals/audit/verification_reachability.json with per-(rung, fold, seed) entries containing target_reachable (bool), target_recall (0.99), achieved_val_recall (float), fallback_threshold (float), fallback_test_fpr (float) |
| Spoke subsection | “Verification-target reachability across trained rungs” — surfaces per-rung reachability rate as a cross-rung comparison artifact |
Audit JSON schema:
{
"<rung_id>": {
"<fold_id>": {
"<seed>": {
"target_reachable": false,
"target_recall": 0.99,
"achieved_val_recall": 0.974,
"fallback_threshold": 0.412,
"fallback_test_fpr": 0.083
}
}
}
}Persistence pre-commit for post-hoc recall-floor sweeps. ADR-013 already locks per-row val + test prediction persistence per (rung, fold, seed). Switching from recall ≥ 99% to recall ≥ 95% (or any other floor) is a re-run of TargetRecallSelector(t').select(y_val, s_val) on the persisted predictions, with no retraining, no re-fitting calibrators, no new GPU time. Pre-bootstrap CIs at the alternative floor regenerate via the existing paired_bootstrap_op_point_diff orchestrator-layer joblib pipeline per ADR-022. The “Recall-floor sensitivity sweep” becomes a one-commit afterword in WRITEUP/threshold-policy.md if needed during Phase 5 review — pre-committed sweep grid is recall floors {95%, 99%, 99.9%} (mirror of ADR-021’s recall@FPR pinpoint triad in FNR space).
Consequences
Positive
- Closes the last open ledger row in §4 Threshold; SPEC §4 fully locked at row level
- Aligns with ADR-006 cost-weighted-thresholding rejection (no
CostSensitiveSelector); dual-policy is two anchor budgets along the ROC curve, not a Bayes-optimal cost derivation - Detection-policy column reuses the ADR-021 recall@FPR=1% pinpoint — zero new headline data on the detection side; +1 verification column total
- Symmetric 1% targets keep the WRITEUP narrative readable; no asymmetric-rationale paragraph
- Per-(rung, fold, seed) aggregation maintains methodology coherence with ADR-022 (same per-(seed) threshold protocol as recall@FPR pinpoints)
- Honest infeasibility reporting via asterisk + audit JSON mirrors ADR-021’s recall@FPR=0.1% volatility-surfacing pattern; treats verification reachability rate as a methodology contribution rather than a hidden caveat
- Persistence pre-commit makes recall-floor sensitivity sweeps a free post-hoc artifact; protects against a Phase 5 reviewer asking “what about 95%?” without requiring rework
- Library-first preserved end-to-end:
TargetFPRSelector+TargetRecallSelector+paired_bootstrap_op_point_diff+metrics_at_thresholdall shipped in eval-toolkit; project-specific glue is the per-(rung, fold, seed) orchestration loop and the audit-JSON emission layer
Negative
- Headline table widens by one column on trained rungs (+1 column × 4 trained rungs); spoke gains a new
WRITEUP/threshold-policy.mddeliverable with 80 cells × 2 policies plus reachability subsection plus deployment scenarios - Reference-scorer rows in headline table carry blank cells under the verification column with footnote — slight asymmetry between trained-rung and reference-rung headline rows
- 96 paired_bootstrap_op_point_diff invocations during Phase 4 bootstrap battery (4 trained rungs × 24 threshold pairs); orchestrator-layer joblib parallelization on 64-core Threadripper per ADR-022 absorbs the cost
- Verification-target reachability is rung-dependent and not guaranteed (see assumption A-009); some rungs may carry mostly asterisks under the verification column
Neutral
- ≥3 deployment scenarios from ADR-006 are co-located in
WRITEUP/threshold-policy.mdwith the dual-policy operating-point grid — single spoke artifact - Cost-weighted thresholding stays rejected; dual-policy framing operates entirely on the FPR / recall axis, not on a fabricated cost ratio
- Detection-policy headline column relabel (footnote on existing recall@FPR=1% cell) is reversible — if a reviewer challenges the relabel, splitting it into a duplicated standalone column is one commit with zero data movement
Alternatives considered
- Symmetric 5% targets — looser detection budget; rejected because 5% FNR verifier is operationally weak (misses 1/20 injections) and 5% FPR detector floods alerts beyond the canonical PromptShield precedent
- Asymmetric (1% / 0.1%) — verification stricter to mirror “verification is the precision-biased policy” deployment intuition; rejected because asymmetry needs a justification paragraph and 99.9% recall floor is unreachable on most rungs
- Symmetric triad (mirror of recall@FPR triad) — full ROC-curve characterization at 6 pinpoints; rejected because detection-side triad is already locked in ADR-021 (numerical redundancy) and verification-side triad multiplies headline width without proportional methodology gain — kept as an afterword sweep via the Q4 persistence pre-commit
- No new headline column (full dual-policy table only in spoke) — minimal headline footprint; rejected because the verification-policy column is genuinely new headline information (FPR @ recall ≥ 99% is not in ADR-021’s pinpoints) and hiding it from A1 reviewer defeats the dual-policy framing purpose
- Step-down sequence on infeasibility (try 99% → 95% → max-achievable) — apparent flexibility; rejected because cell-by-cell tier labels make paired-Δ comparisons across rungs incoherent (same column carries different operating points)
- Drop the cell on infeasibility — apparent simplicity; rejected because aggregating only feasible-rung instances introduces selection bias in pooled aggregations
- Per-rung pooled-validation fitting (8 thresholds total instead of 96) — simpler; rejected because mismatches ADR-022’s per-(seed) protocol and prevents
paired_bootstrap_op_point_difftwo-level CI propagation
Phase 1 deliverables
WRITEUP/threshold-policy.mdspoke filename pre-committed (already on the spoke list per ADR-006); now extended to carry the dual-policy operating-point grid + verification-reachability subsection + recall-floor sensitivity sweep afterwordevals/audit/verification_reachability.jsonschema documented inevals/audit/README.md- Verify
TargetRecallSelectorsource-level unreachable-target behavior at Phase 4 entry (one library check; if the selector raises rather than falls back to lowest threshold, glue wraps in try/except and emits the audit row from the exception path)
References
See frontmatter references list. Primary anchors — eval-toolkit methodology/thresholds.md (selector semantics + two-level bootstrap rationale); Lipton-Elkan-Naryanaswamy 2014 (TargetRecallSelector most-precise convention); PromptShield 2024-2025 (FPR=1% canonical operating point); InjecGuard 2024 (over-defense FPR-floor framing); Elkan 2001 (cost-sensitive learning theory we are not applying); ADR-006 cost-weighted-thresholding rejection; ADR-013 per-row prediction persistence; ADR-018 reference-scorer slate plus contamination caveats; ADR-021 recall@FPR pinpoint triad plus volatility-surfacing precedent; ADR-022 per-(seed) threshold protocol plus paired-bootstrap orchestration; ADR-005 Principle 2 (honest evaluation preferred).
Transcript
See transcripts/2026-05-15__phase-0-05__threshold-policy.md for the Phase 0-05 conversation that led to this decision.