# Bootstrap

> **Background** *(skip if you've internalized this)*. The bootstrap
> turns a point estimate into a confidence interval by *resampling*
> rows from the eval set with replacement, recomputing the metric, and
> reading quantiles off the resampled distribution. It works for any
> metric, makes no parametric assumption, and gives honest CIs that
> reflect finite-sample variance. The cost: O(n × n_resamples) compute
> per CI. Modern toolkits use 1 000–10 000 resamples; eval-toolkit
> defaults to 1 000.

This chapter covers the four primitives in
`eval_toolkit.bootstrap`:
[`bootstrap_ci`](../api/bootstrap.md) (single
condition), [`paired_bootstrap_diff`](../api/bootstrap.md)
(two-condition difference), [`paired_bootstrap_op_point_diff`](../api/bootstrap.md)
(two-level: refit threshold per resample), and
[`cv_clt_ci`](../api/bootstrap.md) (K-fold CV-CI). For
a higher-level overview of model comparison see
[comparison.md](comparison.md); this chapter goes deeper on the
resampling theory.

## Setup

```python
import numpy as np
from eval_toolkit import (
    bootstrap_ci, paired_bootstrap_diff, mde_from_ci, cv_clt_ci, pr_auc,
)
```

A 200-row fixture used throughout:

```python
rng = np.random.default_rng(42)
y = rng.binomial(1, 0.3, size=200).astype(int)
s_a = rng.uniform(0, 1, size=200)
s_b = np.clip(0.6 * y + rng.normal(0, 0.25, size=200), 0, 1)
```

(bootstrap-bca-vs-percentile)=
## BCa vs percentile
Two CI-construction methods ship in `bootstrap_ci`:

- **`method="BCa"`** (default) — Efron 1987: bias-corrected,
  accelerated. Adjusts quantile cuts for skew via a jackknife.
  Most accurate for moderate samples (n ≥ 30) and the
  recommended default in Efron & Tibshirani 1993.
- **`method="percentile"`** — naive `(α/2, 1-α/2)` quantile cut.
  Slightly optimistic on skewed distributions but the recommended
  fallback when BCa's jackknife is degenerate (very small n,
  constant scores).
- **`method="studentized"`** — bootstrap-t. Rare in ML eval; ships for
  completeness.

```python
ci_bca = bootstrap_ci(y, s_b, pr_auc, n_resamples=500, method="BCa", seed=42)
ci_pct = bootstrap_ci(y, s_b, pr_auc, n_resamples=500, method="percentile", seed=42)
print(f"BCa:        {ci_bca.point_estimate:.3f}  CI [{ci_bca.ci_low:.3f}, {ci_bca.ci_high:.3f}]")
print(f"Percentile: {ci_pct.point_estimate:.3f}  CI [{ci_pct.ci_low:.3f}, {ci_pct.ci_high:.3f}]")
```

The interval midpoints match (both methods point-estimate the original
data), but BCa shifts the bounds asymmetrically when the resampled
distribution is skewed. For roughly-symmetric metrics on moderate n
the difference is < 1 %; for highly-skewed cases (small n, rare-positive)
BCa's correction matters.

(bootstrap-paired-bootstrap)=
## Paired bootstrap for two-model comparison
When comparing two models on the *same* eval rows, the metric
difference Δ has *less variance* than each metric individually because
the resampling noise cancels: rows that are hard for A are also hard
for B. Paired bootstrap exploits this by sharing resample indices.

```python
diff = paired_bootstrap_diff(y, s_a, s_b, pr_auc, n_resamples=500, seed=42)
print(f"Δ PR-AUC: {diff.delta:.3f}  CI [{diff.ci_low:.3f}, {diff.ci_high:.3f}]")
print(f"  overlaps zero: {diff.overlaps_zero}")
```

Compare with computing two separate CIs and eyeballing them:

```python
ci_a = bootstrap_ci(y, s_a, pr_auc, n_resamples=500, seed=42)
ci_b = bootstrap_ci(y, s_b, pr_auc, n_resamples=500, seed=42)
print(f"A CI: [{ci_a.ci_low:.3f}, {ci_a.ci_high:.3f}]")
print(f"B CI: [{ci_b.ci_low:.3f}, {ci_b.ci_high:.3f}]")
print(f"unpaired-CI width sum: {(ci_a.ci_high - ci_a.ci_low) + (ci_b.ci_high - ci_b.ci_low):.3f}")
print(f"paired-Δ width:        {diff.ci_high - diff.ci_low:.3f}")
```

The paired CI is reliably tighter — typically 30–50 % narrower for
informative score pairs.

(two-level)=
## Two-level paired bootstrap (refit per resample)
When the metric is *operating-point-dependent* (F1, precision-at-recall)
and the deployment refits the threshold per data batch, the
single-level bootstrap underestimates uncertainty. The two-level
variant ([`paired_bootstrap_op_point_diff`](../api/bootstrap.md))
resamples a *validation* slice, refits the threshold on that
resample, then computes the metric on the *test* resample.

The CI captures both metric variance AND threshold-selection variance.
Empirically 1.5–2× wider than single-level. Reporting the narrower
single-level CI when the deployment refits is overconfident — see
[thresholds.md §"When to refit threshold per resample"](thresholds.md#bootstrap-refit)
for the operational decision.

(bootstrap-mde)=
## Minimum detectable effect (MDE)
A wide CI overlapping zero isn't evidence the two models perform the
same — it's evidence you don't have power to tell. Quantify with MDE:
the smallest true difference your bootstrap procedure would detect at
80 % power.

```python
mde = mde_from_ci(diff, alpha=0.05, power=0.80)
print(f"MDE @ 80 % power: {mde.mde:.4f}")
```

If your MDE is 0.03 and the difference you care about is 0.01, you
need more eval data — running a different statistical test won't
help.

(bootstrap-cv-ci)=
## CV-CI: K-fold bootstrap
[`cv_clt_ci`](../api/bootstrap.md) computes a
CLT-corrected confidence interval over per-fold metric values. Per-fold
metrics are *not* independent — they share training data, so naive
Student's-t CIs over fold metrics are anti-conservative. The
correction (Bates et al. 2024) accounts for this.

```python
fold_metrics = np.array([0.74, 0.76, 0.71, 0.78, 0.73])
ci_cv = cv_clt_ci(fold_metrics, confidence=0.95)
print(f"CV mean: {ci_cv.point_estimate:.3f}  CI [{ci_cv.ci_low:.3f}, {ci_cv.ci_high:.3f}]")
```

`evaluate_folded(...)` auto-computes this for every (slice, scorer,
metric) triple and stores it in `RunResult.fold_summary` — see
[splits.md §"K-fold cross-validation"](splits.md#stratified-kfold).

(budget)=
## Resample budget guidance
| n_resamples | Use case |
|---|---|
| 200 | Quick sanity check during development; CIs accurate to ~3 % |
| 1 000 | **Default**; CIs accurate to ~1 % at the 95 % level |
| 5 000 | Publication-grade; use when reporting lift to a paper or external stakeholder |
| 10 000+ | Diminishing returns; only if the metric is very expensive to compute (LLM-judge) and you want maximum CI precision |

The resampling cost is O(n × n_resamples × metric_cost). For
LLM-judge scorers where the metric implicitly requires LLM calls, this
explodes — pre-compute scores once, then resample on the score arrays
(the toolkit's pattern: caller produces `(y_true, y_score)` arrays
externally and feeds them in).

(bootstrap-pitfalls)=
## Pitfalls / Common mistakes
- **Treating non-overlapping CIs as significance.** Non-overlap implies
  significance, but overlap does NOT imply non-significance. Always
  compute the paired CI on the difference, not two separate CIs.
- **Reporting bootstrap mean instead of point estimate.**
  `BootstrapCI.point_estimate` is the metric on the *original* data,
  not the resample mean. Report `point_estimate ± CI`, not
  `resample_mean ± CI`.
- **Bootstrapping accuracy on n < 30.** BCa's jackknife is degenerate.
  The toolkit emits an error in that regime; fall back to
  `method="percentile"` and document the choice.
- **Not seeding.** `bootstrap_ci(..., seed=42)` makes runs reproducible.
  An unseeded run will give a slightly different CI on every
  invocation — annoying for golden-test discipline and CI flakiness.
- **Comparing ECE bootstrap CIs across runs with different `n_bins`.**
  ECE depends on bin count; bootstrap inherits that sensitivity. Pin
  `n_bins` in your project config.
- **Bootstrapping on the train set.** The bootstrap quantifies sample
  variance of *your eval set*, not generalization to a new population.
  For OOD claims you still need a held-out test set — see
  [splits.md §"When CV alone is insufficient"](splits.md#cv-and-ood).

## Further reading

- Efron, B. & Tibshirani, R. *An Introduction to the Bootstrap.*
  Chapman & Hall, 1993. **The canonical reference.** §14 derives BCa.
- DiCiccio, T. & Efron, B. *Bootstrap confidence intervals.*
  Statistical Science 11(3), 1996. — comparison of CI methods + when
  each fails.
- Bates, S., Hastie, T., & Tibshirani, R. *Cross-validation: what
  does it estimate and how well does it do it?* JASA 2024. — basis
  for `cv_clt_ci`'s CLT correction.
- Davison, A. C. & Hinkley, D. V. *Bootstrap Methods and their
  Application.* Cambridge, 1997. — alternate canonical text;
  good complement to Efron & Tibshirani.

See also: [comparison.md](comparison.md) (higher-level model
comparison framing), [splits.md](splits.md) (K-fold context for
`cv_clt_ci`), [calibration.md](calibration.md) (paired-ECE-difference
specifics).