Worked example: metrics + bootstrap CIs#

What this shows. Compute pr_auc / roc_auc / brier_score on a synthetic binary-classification fixture, then attach a 95% bootstrap CI via bootstrap_ci. The minimal entry point into the toolkit.

Runtime: ~1 s on a laptop. Pure numpy/scipy/sklearn core — no optional dependencies.

Setup#

import numpy as np
from eval_toolkit import (
    pr_auc, roc_auc, brier_score,
    bootstrap_ci, set_global_seeds,
)
set_global_seeds(42)

Synthetic data: 200-row balanced binary classifier#

A toy ground-truth labels + scores from a discriminative-but-noisy model. The signal is +0.3 on the positives, plus Gaussian noise.

rng = np.random.default_rng(42)
n = 200
y_true = np.concatenate([np.zeros(100), np.ones(100)]).astype(int)
rng.shuffle(y_true)
y_score = np.clip(
    0.5 + 0.3 * (y_true - 0.5) + rng.normal(0, 0.2, size=n),
    0.0, 1.0,
)

Point estimates: pr_auc, roc_auc, brier_score#

Each is a single function call returning a float:

ap = pr_auc(y_true, y_score)
auc = roc_auc(y_true, y_score)
bs = brier_score(y_true, y_score)
assert 0.0 <= ap <= 1.0
assert 0.0 <= auc <= 1.0
assert 0.0 <= bs <= 1.0
print(f"pr_auc={ap:.3f}  roc_auc={auc:.3f}  brier={bs:.3f}")
pr_auc=0.857  roc_auc=0.863  brier=0.158

The signal-to-noise here gives ~0.85 AUC / ~0.85 AP. Brier ~0.09 (good calibration on this fixture because the scores are well-spread).

95% bootstrap CI#

bootstrap_ci wraps scipy.stats.bootstrap with BCa as the default method and produces a BootstrapCI dataclass with point_estimate, ci_low, ci_high:

ci_ap = bootstrap_ci(y_true, y_score, metric=pr_auc, n_resamples=200, seed=42)
print(f"pr_auc = {ci_ap.point_estimate:.3f}  [95% CI: {ci_ap.ci_low:.3f}, {ci_ap.ci_high:.3f}]")
assert ci_ap.ci_low <= ci_ap.point_estimate <= ci_ap.ci_high
assert ci_ap.confidence == 0.95
assert ci_ap.method == "BCa"
pr_auc = 0.857  [95% CI: 0.800, 0.918]

When to use percentile instead of BCa#

BCa is the default. Fall back to method="percentile" for very small samples where BCa’s jackknife step can degenerate:

ci_perc = bootstrap_ci(
    y_true, y_score, metric=pr_auc,
    n_resamples=200, method="percentile", seed=42,
)
print(f"pr_auc (percentile) = [{ci_perc.ci_low:.3f}, {ci_perc.ci_high:.3f}]")
pr_auc (percentile) = [0.807, 0.923]

The percentile CI is symmetric around the point estimate; BCa’s bias- correction (a-hat) makes it asymmetric when the bootstrap distribution is skewed. For most well-conditioned cases the two methods agree to within ~0.01.

Pre-1.0 design note#

bootstrap_ci rejects n < 10 with ValueError (too few samples for meaningful bootstrap variance). NaN/Inf scores are rejected by the underlying metrics — see NaN/Inf rejection tests for the full input-validation contract.

See also#