--- jupytext: text_representation: extension: .md format_name: myst kernelspec: display_name: Python 3 language: python name: python3 --- # Worked example: paired bootstrap comparison > **What this shows.** Compare two scorers on the same slice via > `paired_bootstrap_diff`. The output's CI tells you whether one > scorer is *statistically* better than the other on this slice — > not just "the point estimate is higher." > > **Runtime:** ~2 s. Pure-numpy/scipy core; no optional deps. ## Why paired (not independent) When two scorers are evaluated on the **same** y_true samples, their scores are correlated. An unpaired comparison treats them as if they came from independent data — wider CIs, statistical power wasted. `paired_bootstrap_diff` resamples *indices* (instead of separate arrays per scorer), preserving the within-sample correlation. The classic illustration: if scorer B agrees with scorer A on 99 of 100 samples but differs on the one tricky one, the paired test correctly attributes the small difference; the unpaired test sees two near-identical distributions and concludes "no signal." ## Setup ```{code-cell} import numpy as np from eval_toolkit import paired_bootstrap_diff, pr_auc, set_global_seeds set_global_seeds(42) ``` ## Synthetic data: two scorers on the same labels Build a scenario where scorer B is *slightly* better than A (small but real signal). The labels are shared; only the score arrays differ: ```{code-cell} rng = np.random.default_rng(42) n = 200 y_true = np.concatenate([np.zeros(100), np.ones(100)]).astype(int) rng.shuffle(y_true) # Scorer A: discriminative-but-noisy s_a = np.clip(0.5 + 0.3 * (y_true - 0.5) + rng.normal(0, 0.2, size=n), 0.0, 1.0) # Scorer B: same shape but with stronger signal s_b = np.clip(0.5 + 0.4 * (y_true - 0.5) + rng.normal(0, 0.2, size=n), 0.0, 1.0) print(f"pr_auc(A) = {pr_auc(y_true, s_a):.3f}") print(f"pr_auc(B) = {pr_auc(y_true, s_b):.3f}") print(f"point delta = {pr_auc(y_true, s_b) - pr_auc(y_true, s_a):.3f}") ``` ## Paired bootstrap CI on the difference `paired_bootstrap_diff(y, s_a, s_b, metric=pr_auc, n_resamples=...)` returns a `PairedBootstrapCI` with the **delta**'s CI bounds. If the CI excludes zero, the difference is significant at the configured confidence level (default 95%): ```{code-cell} result = paired_bootstrap_diff( y_true, s_a, s_b, metric=pr_auc, n_resamples=500, seed=42, ) print(f"delta = {result.delta:.3f} [95% CI: {result.ci_low:.3f}, {result.ci_high:.3f}]") significant = result.ci_low > 0 or result.ci_high < 0 print(f"Significant at 95% confidence: {significant}") ``` ## Reading the output - **`delta` = `pr_auc(B) - pr_auc(A)`** point estimate (sign convention: positive means B is better) - **`ci_low`, `ci_high`** are the BCa-style 95% CI bounds on the delta - If `ci_low > 0`, B is significantly better (at α=0.05) - If `ci_high < 0`, A is significantly better - If the CI straddles zero, the difference is not significant — even if the point estimate suggests one direction ## MDE: how much delta could you have detected? `mde_from_ci` complements `paired_bootstrap_diff` by reporting the **minimum detectable effect** given the CI width — useful when the result is "no significant difference" and you want to claim "we would have caught at least a delta of X if there were one": ```{code-cell} from eval_toolkit import mde_from_ci mde = mde_from_ci(result, alpha=0.05, power=0.80) print(f"MDE at 80% power: {mde.mde:.3f}") print(f" (vs observed delta: {mde.delta_observed:.3f})") ``` If `mde.mde = 0.05`, you had power to detect a 5pp improvement in pr_auc; smaller real differences would have been underpowered. Pair this with the CI from `paired_bootstrap_diff` to make claims like "no significant difference, and we had power to detect ≥ X". ## Pre-1.0 design note `paired_bootstrap_diff` defaults to BCa quantile arithmetic for the delta's CI. Asymptotic normality is *not* assumed — bootstrap handles skewed distributions natively. For multi-comparison settings (e.g., testing scorer B vs A across 6 slices), apply Bonferroni / BH-FDR correction to the resulting p-values; see issue [#1](https://github.com/brandon-behring/eval-toolkit/issues/1). ## See also - [`bootstrap.py` reference](../api/bootstrap.md) — `paired_bootstrap_diff`, `paired_mde`, `mde_from_ci`, `paired_bootstrap_op_point_diff` for operating-point comparisons. - [`metrics.py` reference](../api/metrics.md) — any metric with the `(y, s) -> float` signature works as the `metric` argument. - [Metrics + bootstrap example](metrics_and_bootstrap.md) — the single-scorer baseline. - [Evaluate harness example](evaluate_harness.md) — pass `paired_diffs=[("A", "B")]` to `evaluate(...)` to get this automatically per slice.