---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

# Worked example: metrics + bootstrap CIs

> **What this shows.** Compute `pr_auc` / `roc_auc` / `brier_score` on a
> synthetic binary-classification fixture, then attach a 95% bootstrap CI
> via `bootstrap_ci`. The minimal entry point into the toolkit.
>
> **Runtime:** ~1 s on a laptop. Pure numpy/scipy/sklearn core — no
> optional dependencies.

## Setup

```{code-cell}
import numpy as np
from eval_toolkit import (
    pr_auc, roc_auc, brier_score,
    bootstrap_ci, set_global_seeds,
)
set_global_seeds(42)
```

## Synthetic data: 200-row balanced binary classifier

A toy ground-truth labels + scores from a discriminative-but-noisy
model. The signal is `+0.3` on the positives, plus Gaussian noise.

```{code-cell}
rng = np.random.default_rng(42)
n = 200
y_true = np.concatenate([np.zeros(100), np.ones(100)]).astype(int)
rng.shuffle(y_true)
y_score = np.clip(
    0.5 + 0.3 * (y_true - 0.5) + rng.normal(0, 0.2, size=n),
    0.0, 1.0,
)
```

## Point estimates: pr_auc, roc_auc, brier_score

Each is a single function call returning a float:

```{code-cell}
ap = pr_auc(y_true, y_score)
auc = roc_auc(y_true, y_score)
bs = brier_score(y_true, y_score)
assert 0.0 <= ap <= 1.0
assert 0.0 <= auc <= 1.0
assert 0.0 <= bs <= 1.0
print(f"pr_auc={ap:.3f}  roc_auc={auc:.3f}  brier={bs:.3f}")
```

The signal-to-noise here gives ~0.85 AUC / ~0.85 AP. Brier ~0.09 (good
calibration on this fixture because the scores are well-spread).

## 95% bootstrap CI

`bootstrap_ci` wraps `scipy.stats.bootstrap` with BCa as the default
method and produces a `BootstrapCI` dataclass with `point_estimate`,
`ci_low`, `ci_high`:

```{code-cell}
ci_ap = bootstrap_ci(y_true, y_score, metric=pr_auc, n_resamples=200, seed=42)
print(f"pr_auc = {ci_ap.point_estimate:.3f}  [95% CI: {ci_ap.ci_low:.3f}, {ci_ap.ci_high:.3f}]")
assert ci_ap.ci_low <= ci_ap.point_estimate <= ci_ap.ci_high
assert ci_ap.confidence == 0.95
assert ci_ap.method == "BCa"
```

## When to use percentile instead of BCa

BCa is the default. Fall back to `method="percentile"` for very small
samples where BCa's jackknife step can degenerate:

```{code-cell}
ci_perc = bootstrap_ci(
    y_true, y_score, metric=pr_auc,
    n_resamples=200, method="percentile", seed=42,
)
print(f"pr_auc (percentile) = [{ci_perc.ci_low:.3f}, {ci_perc.ci_high:.3f}]")
```

The percentile CI is symmetric around the point estimate; BCa's bias-
correction (`a-hat`) makes it asymmetric when the bootstrap distribution
is skewed. For most well-conditioned cases the two methods agree to
within ~0.01.

## Pre-1.0 design note

`bootstrap_ci` rejects `n < 10` with `ValueError` (too few samples for
meaningful bootstrap variance). NaN/Inf scores are rejected by the
underlying metrics — see
[NaN/Inf rejection tests](https://github.com/brandon-behring/eval-toolkit/blob/main/tests/test_metrics_props.py)
for the full input-validation contract.

## See also

- [`metrics.py` reference](../api/metrics.md) — full list of available
  metrics (PR-AUC, ROC-AUC, Brier, ECE variants, `headline_metrics`
  bundle).
- [`bootstrap.py` reference](../api/bootstrap.md) — `paired_bootstrap_diff`
  for two-scorer comparisons, `cv_clt_ci` for cross-validated CIs.
- [Evaluate harness example](evaluate_harness.md) — same metrics applied
  via the slice-aware orchestrator.