# Stratified PR-AUC & the gap-flag report

> **Background** *(skip if you've internalized this)*. A binary
> classifier's headline PR-AUC can mask domain-specific failure: the
> model may exploit a *confounder* correlated with the label
> distribution rather than the underlying signal. Examples: text length
> (long inputs are more common for one class than the other),
> time-of-day (label distribution drifts), document source (one source
> is 90 % positive, another 10 %). A *stratified* PR-AUC, computed on
> the central window of the confounder, removes the tails where one
> class dominates and isolates whether the model still discriminates
> when the confounder is held roughly constant. McClish 1989 framed
> this as "partial AUC" for ROC curves; the same idea applies to PR
> space.

This chapter covers `quantile_stratified_pr_auc` (the trimmed
primitive) and `quantile_stratified_report` (the four-field reporting
wrapper that became canonical in the SDD `REPORT.md` convention).

## Setup

```python
import numpy as np
from eval_toolkit import (
    pr_auc, quantile_stratified_pr_auc, quantile_stratified_report,
)
```

A 500-row fixture where the model's signal is genuine but a
*confounder* (text length) is correlated with the label distribution:

```python
rng = np.random.default_rng(42)
n = 500
y = rng.binomial(1, 0.3, size=n)
# Length is correlated with label: positives tend to be longer.
length = rng.integers(20, 100, size=n) + (y * 80)
# Score has two components: (a) genuine signal, (b) length confound.
score_signal = y * 0.5 + rng.normal(0, 0.25, n)
score_confound = (length - length.mean()) / length.std() * 0.2
s = np.clip(score_signal + score_confound, 0, 1)
```

The full PR-AUC will look great because the model "sees" both the
signal and the confound. Stratifying on length removes the
confound-tails:

```python
full = pr_auc(y, s)
print(f"Full PR-AUC: {full:.3f}")

trimmed_block = quantile_stratified_pr_auc(y, s, length, q_low=0.25, q_high=0.75)
print(f"Trimmed PR-AUC (central 50 %): {trimmed_block['pr_auc']:.3f}")
print(f"Window: lengths in [{trimmed_block['stratifier_low']:.0f}, "
      f"{trimmed_block['stratifier_high']:.0f}]")
```

If the trimmed metric is much lower than the full metric, the model
was riding the confound. The four-field report makes this auditable in
one shot:

```python
report = quantile_stratified_report(y, s, length, gap_threshold=0.05)
print(report)
```

(gap-flag)=
## The gap-flag convention
`quantile_stratified_report` returns:

```text
{"full": ..., "trimmed": ..., "gap": full - trimmed, "gap_flag": gap > threshold}
```

- **`full`** — PR-AUC over all rows. The headline metric.
- **`trimmed`** — PR-AUC over the central `[q_low, q_high]` quantile
  window of the stratifier. Default `[0.25, 0.75]` keeps the middle 50 %.
- **`gap`** — `full - trimmed`. Positive ⇒ the tails inflate the
  headline. Negative ⇒ the tails are *harder* than the central window
  (rarer; usually a sample-size artifact when one tail is small).
- **`gap_flag`** — `gap > gap_threshold` (default 0.05). A single bit
  designed for at-a-glance reviewer-friendly tables.

The 0.05 threshold is the SDD reporting convention; tune for your
domain. A gap of 0.02 is usually noise; 0.10+ is almost always real
confound exploitation.

(length_stratification-when)=
## When to use
Use `quantile_stratified_report` whenever you suspect the score
correlates with a continuous covariate that's *also* correlated with
the label. Common stratifiers:

- **Text length** — for prompt-injection / safety eval, length is
  notoriously confounded with the label distribution (long jailbreak
  prompts, short benign queries). The gap reveals whether the model
  learned the attack pattern or just "long text → flag it".
- **Time** — for any drift-prone task (recommender systems, fraud
  detection), the gap on a temporal stratifier shows whether the model
  generalizes across time periods.
- **Source** — when training pools mix sources with different label
  priors (Lakera + LLMail + OASST in PI eval), the source-stratified
  gap indicates whether the model is per-source-overfit.
- **The score itself** — passing the score as the stratifier is
  equivalent to McClish 1989's partial-AUC over a score-quantile
  window. Useful when you only care about the operating range
  `[q_low, q_high]` of scores (e.g., "the part of the PR curve where
  precision is between 0.7 and 0.9").

(when-not)=
## When NOT to use
- **The stratifier is the label** — degenerate; you'd be measuring
  PR-AUC on a single class.
- **The stratifier is unrelated to the label** — the gap will be
  noise. Verify upstream (e.g., `np.corrcoef(label, stratifier)`)
  before reading the gap as evidence of confound exploitation.
- **n is too small in the central window** — the toolkit raises
  `ValueError` if either class has fewer than 10 rows in the trimmed
  subset. A failing gap report on small slices is usually just
  insufficient power.

## Putting it all together

A complete length-stratified audit:

```python
from eval_toolkit import expected_calibration_error_l2_debiased

print("=" * 60)
print(f"Headline PR-AUC: {pr_auc(y, s):.3f}")
print(f"Headline ECE:    {expected_calibration_error_l2_debiased(y, s):.4f}")
print("-" * 60)
report = quantile_stratified_report(y, s, length, gap_threshold=0.05)
print(f"Length-stratified report:")
for k, v in report.items():
    print(f"  {k}: {v}")
if report["gap_flag"]:
    print("⚠  GAP > 0.05: model may exploit length confound.")
```

If `gap_flag` fires, the right next step is *not* "drop the model"
— it's "investigate the confound". Maybe length actually IS a
legitimate signal in your domain (longer documents really are more
likely to be one class). The gap-flag is a *prompt for inspection*,
not a verdict.

(length_stratification-pitfalls)=
## Pitfalls / Common mistakes
- **Reporting `full` without `trimmed`.** A single PR-AUC number is
  uninspectable. Always pair with the stratified version when a
  plausible confounder exists.
- **Picking q_low / q_high arbitrarily.** The default `[0.25, 0.75]`
  is the SDD convention but may be wrong for your distribution. If
  the stratifier is bimodal (e.g., very-short and very-long with
  little in between), the central 50 % window is mostly noise; use a
  narrower window or a different stratifier.
- **Treating the gap as a calibrated probability.** It's a
  diff-of-PR-AUCs, bounded in `[-1, 1]` but not a fraction of any
  meaningful quantity. Don't say "the model is 8 % more confounded".
- **Reporting gap without CIs.** Both `full` and `trimmed` have
  bootstrap variance; the gap inherits both. For statistically-rigorous
  reporting, bootstrap each, then bootstrap the difference. (See
  [comparison.md](comparison.md).)
- **Stacking many stratifications.** If you run gap-reports on length,
  source, time, and 5 other covariates, you're multiple-testing
  yourself into noise. Pick the 1–2 stratifiers your domain knowledge
  flags as plausible.

## Further reading

- McClish, D. K. *Analyzing a portion of the ROC curve.* Medical
  Decision Making 9(3), 1989. — partial-AUC framework, the conceptual
  ancestor of stratified PR-AUC.
- Saito, T. & Rehmsmeier, M. *The precision-recall plot is more
  informative than the ROC plot when evaluating binary classifiers
  on imbalanced datasets.* PLOS ONE 10(3), 2015. — why PR-AUC over
  ROC-AUC on imbalanced data.
- Recht, B. et al. *Do ImageNet classifiers generalize to ImageNet?*
  ICML 2019. — empirical case study of distribution-shift gaps that
  the headline metric hides.

See also: [calibration.md](calibration.md) (decomposing a metric into
reliability + resolution + uncertainty),
[comparison.md](comparison.md) (bootstrap CIs on differences).