# Threshold selection

> **Background** *(skip if you've internalized this)*. A binary classifier
> outputs a real-valued score per row; a *threshold* turns that score
> into a 0/1 prediction. The threshold is a free parameter — different
> thresholds give different (precision, recall, FPR) trade-offs along
> the model's PR / ROC curves. Picking the threshold is therefore an
> *operational* choice (downstream cost of FP vs FN, target rate, etc.)
> that the model itself doesn't determine. The classic mistake is
> reporting metrics at the toolkit's default threshold (often 0.5) when
> the deployment will use a tuned operating point.

This chapter covers the six [`ThresholdSelector`
](../api/thresholds.md) reference impls, the Bayes-
optimal derivation, and the subtle question of when to refit the
threshold per bootstrap resample.

## Setup

```python
import numpy as np
from eval_toolkit import metrics_at_threshold, select_threshold
```

A 200-row informative fixture used throughout:

```python
rng = np.random.default_rng(42)
y = rng.binomial(1, 0.3, size=200).astype(int)
s = np.clip(y * 0.6 + rng.normal(0, 0.25, 200), 0, 1)
```

(migration)=
## v0.7.0 BREAKING — `select_threshold` now takes a Selector instance
The v0.6 string form is **removed**. Every call site passes a
[`ThresholdSelector`](../api/thresholds.md) instance.

| v0.6 | v0.7 |
|---|---|
| `criterion="max_f1"` | `MaxF1Selector()` |
| `criterion="recall_0.90"` | `TargetRecallSelector(0.90)` |
| `criterion="recall_0.95"` | `TargetRecallSelector(0.95)` |
| `criterion="precision@0.90"` | `TargetPrecisionSelector(0.90)` *(new)* |
| `criterion="recall@0.90"` | `TargetRecallSelector(0.90)` |

Passing a string raises `TypeError` with the migration mapping in the
message. See `CHANGELOG.md` for the rationale.

(selectors)=
## Selectors
(max-f1)=
### Max-F1
**When to use.** Default for binary classification when costs are
unknown / symmetric / "we want a balanced operating point". Especially
common as a *reporting* number — pair the model's max-F1 threshold with
its threshold-free metrics (PR-AUC, ROC-AUC) for a complete picture.

**Optimality.** Lipton, Elkan & Naryanaswamy
([2014](https://arxiv.org/abs/1402.1892)) prove that the max-F1
threshold over the empirical PR curve is the F1-optimal among all
constant decision rules — under the labeling distribution observed in
the eval slice.

```python
from eval_toolkit import MaxF1Selector

result = MaxF1Selector().select(y, s)
print(f"max-F1 threshold={result.threshold:.3f}  F1={result.f1:.3f}")
```

(target-recall)=
### Target recall
**When to use.** Operationally, "find ≥X % of positives". Replaces the
`recall_0.90` / `recall_0.95` strings from v0.6.

**Semantics.** Picks the **highest threshold** meeting recall ≥ target —
i.e., the *most-precise* operating point still satisfying the floor.
Recall is monotonically non-increasing in threshold, so among the
contiguous block of thresholds that meet the floor we pick the rightmost
(lowest recall, highest precision).

```python
from eval_toolkit import TargetRecallSelector

result = TargetRecallSelector(0.90).select(y, s)
print(f"recall>=0.90 threshold={result.threshold:.3f}  recall={result.recall:.3f}")
```

(target-precision)=
### Target precision (NEW in v0.7.0)
**When to use.** Operationally, "every positive prediction must be ≥X %
likely true" — e.g., a moderation queue where each FP costs human
attention. Required by `prompt-injection-sdd`'s `"precision@0.90"`
workflow which previously forced a local re-implementation of
`select_threshold`.

**Semantics.** Picks the **smallest threshold** meeting precision ≥
target — the *highest-recall* operating point still satisfying the
floor. Precision is *not* monotonic in threshold, so the full PR curve
is scanned.

```python
from eval_toolkit import TargetPrecisionSelector

result = TargetPrecisionSelector(0.50).select(y, s)
print(f"precision>=0.50 threshold={result.threshold:.3f}  precision={result.precision:.3f}")
```

(target-fpr)=
### Target FPR
**When to use.** Operationally, "alarm-rate must be ≤X %" — e.g., a
classifier feeding a paged queue where false alarms cost on-call
minutes.

**Semantics.** Smallest threshold meeting FPR ≤ target on the ROC
curve — i.e., the *highest-TPR* feasible point.

```python
from eval_toolkit import TargetFPRSelector

result = TargetFPRSelector(0.10).select(y, s)
fpr = metrics_at_threshold(y, s, result.threshold)["fpr"]
print(f"FPR<=0.10 threshold={result.threshold:.3f}  fpr={fpr:.3f}")
```

(youden)=
### Youden's J
**When to use.** When you want a single "balanced" operating point and
costs are unknown. Threshold-free criterion that maximizes
TPR − FPR.

**Reference.** Youden, W. J. *Index for rating diagnostic tests.*
Cancer 3(1), 1950.

```python
from eval_toolkit import YoudenJSelector

result = YoudenJSelector().select(y, s)
print(f"Youden-J threshold={result.threshold:.3f}")
```

(cost-sensitive)=
### Cost-sensitive (Bayes-optimal)
**When to use.** When you have explicit FP / FN costs and a deployment
prior π = P(y=1). The closed-form Bayes-optimal threshold from Elkan
([2001](https://www.cs.iastate.edu/~honavar/elkan.pdf)) §4 is

$$
t^* = \frac{c_\text{fp}\,(1-\pi)}{c_\text{fp}\,(1-\pi) + c_\text{fn}\,\pi}.
$$

**Important caveat.** Bayes-optimal assumes **calibrated probability
scores** in [0, 1]. For raw logits, apply a calibrator first
([calibration.md](calibration.md): `fit_temperature`,
`fit_isotonic_calibrator`, `fit_platt_calibrator`).

```python
from eval_toolkit import CostMatrix, CostSensitiveSelector

cm = CostMatrix(prior=0.30, fp_cost=1.0, fn_cost=2.0)
result = CostSensitiveSelector(cm).select(y, s)
print(f"cost-sensitive threshold={result.threshold:.3f}  (Bayes-optimal)")
```

The threshold is determined entirely by `(prior, fp_cost, fn_cost)` —
the data is only used to compute precision / recall / F1 *at* that
threshold for reporting. This is intentional: Bayes-optimal is a
*derivation*, not an empirical search.

(bootstrap-refit)=
## When to refit threshold per bootstrap resample
A subtle methodological choice: when reporting a CI on F1 (or any
threshold-dependent metric), should the threshold be fixed once on the
full data, or refit on each bootstrap resample?

**Fix once (single-level bootstrap).** Use
[`bootstrap_ci`](../api/bootstrap.md) wrapping a
function that takes `(y_true, y_score)` and uses a pre-fit threshold.
The CI captures uncertainty in the *metric at this fixed threshold*.

**Refit per resample (two-level bootstrap).** Use
[`paired_bootstrap_op_point_diff`](../api/bootstrap.md)
which re-runs the threshold selection on each resample, then computes
the metric. The CI captures *both* the metric uncertainty and the
threshold-selection uncertainty.

**Which to use.** Refit per resample when the threshold-selection rule
has meaningful variance (small slices, noisy PR curves) AND you're
reporting *operating-point* metrics (F1, precision, recall) AND the
operational deployment will re-run threshold selection on each new data
batch. Otherwise fix once — simpler, narrower CI, easier to interpret.

The two-level bootstrap CI is wider (often 1.5–2× wider) than the
fixed-threshold CI because it includes selection variance. Reporting the
narrower one when the deployment really refits is overconfident.

```python
from eval_toolkit import paired_bootstrap_op_point_diff, pr_auc, MaxF1Selector

s_a = np.clip(rng.normal(0.5, 0.3, size=200), 0, 1)
s_b = np.clip(y * 0.5 + rng.normal(0.2, 0.3, size=200), 0, 1)

def threshold_fn(yt, ys):
    return MaxF1Selector().select(yt, ys).threshold

def f1_at(yt, ys, t):
    from eval_toolkit import metrics_at_threshold
    return float(metrics_at_threshold(yt, ys, t)["f1"])

# Two-level: refits the max-F1 threshold per resample on the val side,
# applies it on the test side, computes paired F1 difference.
diff = paired_bootstrap_op_point_diff(
    val_y=y, val_score_a=s_a, val_score_b=s_b,
    test_y=y, test_score_a=s_a, test_score_b=s_b,
    threshold_fn=threshold_fn, metric_fn=f1_at,
    n_resamples=200, seed=42,
)
print(f"Δ F1: {diff.delta:.3f}  CI [{diff.ci_low:.3f}, {diff.ci_high:.3f}]")
```

(thresholds-threshold-transfer)=
## Applying validation thresholds to other slices
For OOD or diagnostic slices, the common pattern is to fit the
threshold on a mixed-class validation slice and apply it elsewhere.
Use [`operating_points.py`](../api/operating_points.md)
for this instead of hand-rolling post-processing. Mixed-class targets
report `metrics_at_threshold`; all-positive targets report
`recall@threshold`; all-negative targets report `fpr@threshold` and
specificity.

See [evidence.md](evidence.md#threshold-transfer) for the full
claim-evidence framing.

(thresholds-pitfalls)=
## Pitfalls / Common mistakes
- **Reporting metrics at threshold = 0.5 by default.** sklearn defaults
  to 0.5 for `predict()`. For an imbalanced classifier with a calibrated
  output, the Bayes-optimal threshold is rarely 0.5 — it's
  `(1−π) / (1+(c_fn / c_fp − 1)·π)`. Pick a selector deliberately.
- **Picking the threshold on the test set.** This is selection on the
  same data you'll evaluate on — straightforward leakage. Pick the
  threshold on a *validation* slice carved off the train fold.
- **Fitting `CostSensitiveSelector` on uncalibrated scores.** The
  Bayes-optimal formula assumes the scores are P(y=1 | x). Raw logits
  or pre-sigmoid model outputs aren't probabilities. Calibrate first.
- **Treating `criterion="max_f1"` as universally best.** Max-F1 is
  appropriate when (a) F1 is the right metric and (b) costs are roughly
  symmetric. For asymmetric-cost deployments, use
  `CostSensitiveSelector` or `TargetFPRSelector`.
- **Reporting fixed-threshold CI when deployment refits.** Use the
  two-level bootstrap if production retunes the threshold on each new
  batch. The narrower fixed-threshold CI is overconfident.

- **`recall@p` semantics divergence on migration.** If you migrated
  from a code base where `recall@p` (or `"recall_0.90"` in the v0.6
  string API) picked the *smallest* threshold meeting the recall floor
  — i.e., the *highest-recall* feasible point — eval-toolkit's
  [`TargetRecallSelector(p)`](../api/thresholds.md)
  picks the *highest* threshold meeting the floor — i.e., the
  *most-precise* feasible point. Both are valid operating points; we
  standardize on the most-precise convention because it matches
  Lipton-Elkan 2014 §3 and the canonical "least-aggressive predictor
  that still meets the recall floor" framing. *Numerical effect*: same
  recall (≥ p), but the toolkit's threshold is ≥ the legacy
  threshold — and therefore precision is ≥ the legacy precision. If
  you have a golden fixture pinned to the legacy convention, you'll
  see threshold differences but recall ≥ p still holds. See the
  `prompt-injection-sdd` v0.7.0 migration commit for a worked example
  (its `recall@0.90` golden case is now skipped with this rationale).

## Putting it all together

Compare four selectors side-by-side:

```python
from eval_toolkit import (
    MaxF1Selector, TargetRecallSelector, TargetPrecisionSelector,
    YoudenJSelector,
)

selectors = {
    "max_f1":          MaxF1Selector(),
    "recall_0.90":     TargetRecallSelector(0.90),
    "precision_0.50":  TargetPrecisionSelector(0.50),
    "youden_j":        YoudenJSelector(),
}
for name, sel in selectors.items():
    r = sel.select(y, s)
    print(f"  {name:18s}  thr={r.threshold:.3f}  P={r.precision:.3f}  "
          f"R={r.recall:.3f}  F1={r.f1:.3f}")
```

## Further reading

- Lipton, Z., Elkan, C., & Naryanaswamy, B. *Optimal thresholding of
  classifiers to maximize F1 measure.* ECML PKDD 2014.
  [arXiv:1402.1892](https://arxiv.org/abs/1402.1892).
- Elkan, C. *The foundations of cost-sensitive learning.* IJCAI 2001.
- Youden, W. J. *Index for rating diagnostic tests.* Cancer 3(1), 1950.
- sklearn docs:
  [`precision_recall_curve`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html),
  [`roc_curve`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html).

See also: [calibration.md](calibration.md) (calibrate before cost-
sensitive selection), [comparison.md](comparison.md) (single-level vs
two-level bootstrap).