# Comparison & confidence intervals > **Background** *(skip if you've internalized this)*. A point estimate of > a metric without a confidence interval is a vibe. Two-model comparisons > that don't account for paired data (same eval rows, two models) lose > power. "Model B is better than A" is a *statistical* claim that > requires a CI on the difference, not just two separate CIs that happen > to look apart. This chapter covers the bootstrap-based machinery > eval-toolkit ships and when to reach for things it doesn't (McNemar, > DeLong). ## Setup ```python import numpy as np from eval_toolkit import ( bootstrap_ci, paired_bootstrap_diff, mde_from_ci, cv_clt_ci, pr_auc, roc_auc, ) ``` Two synthetic scorers: A (random) and B (informative). ```python rng = np.random.default_rng(42) y = rng.binomial(1, 0.3, size=300).astype(int) s_a = rng.uniform(0, 1, size=300) s_b = np.clip(0.6 * y + rng.normal(0, 0.25, size=300), 0, 1) ``` (comparison-bca-vs-percentile)= ## Single-condition CI: BCa vs percentile A bootstrap CI for a metric resamples paired `(y_true, y_score)` indices with replacement, computes the metric per resample, and reports a quantile-based interval over the resampled distribution. eval-toolkit's [`bootstrap_ci`](../api/bootstrap.md) supports three methods: - **BCa** (bias-corrected, accelerated) — the default. Adjusts the quantile cuts for skew in the resampled distribution. Most accurate for moderate sample sizes (n ≥ 30); the standard recommendation in Efron & Tibshirani (1993). - **percentile** — naive quantile cut, no bias correction. Slightly optimistic on skewed distributions but the recommended fallback when BCa's jackknife step is degenerate (very small n, constant scores). - **studentized** — bootstrap-t. Rare in ML eval; ships for completeness. ```python ci = bootstrap_ci(y, s_b, pr_auc, n_resamples=1000, method="BCa", seed=42) print(f"BCa: PR-AUC={ci.point_estimate:.3f} CI [{ci.ci_low:.3f}, {ci.ci_high:.3f}]") ci_p = bootstrap_ci(y, s_b, pr_auc, n_resamples=1000, method="percentile", seed=42) print(f"pct: PR-AUC={ci_p.point_estimate:.3f} CI [{ci_p.ci_low:.3f}, {ci_p.ci_high:.3f}]") ``` > **What NOT to do.** Don't fall back to percentile by default. BCa is > almost always the right choice for ML eval; reserve percentile for > the explicit fallback case (`n < 30`, constant scores, BCa raises). (comparison-paired-bootstrap)= ## Paired bootstrap for two-model comparison When comparing two models on the same eval set, the metric difference $\Delta = M_B - M_A$ has *less variance* than each metric individually because the resampling noise cancels — the same row that's hard for A is also hard for B. Use paired bootstrap to exploit this. eval-toolkit's [`paired_bootstrap_diff`](../api/bootstrap.md) shares resample indices across both scorers — the same rows enter each resample for both A and B. ```python diff = paired_bootstrap_diff( y, s_a, s_b, pr_auc, n_resamples=1000, seed=42, ) print(f"Δ PR-AUC: {diff.delta:.3f} CI [{diff.ci_low:.3f}, {diff.ci_high:.3f}]") print(f" overlaps zero: {diff.overlaps_zero}") ``` The `overlaps_zero` field is the headline finding: when the CI doesn't contain 0, you have evidence that B is better (or worse) than A at the chosen confidence level. When it does, you don't. > **Why paired CIs are tighter.** If you computed two unpaired CIs > independently and just visually compared them, you would *under- > reject*: there's a well-known result that non-overlapping CIs imply > a significant difference, but *overlapping* CIs do not imply > non-significance. Always compute the paired CI explicitly. (ece-differences)= ## ECE differences For ECE differences, use [`paired_bootstrap_ece_diff`](../api/bootstrap.md) which threads `n_bins` through the metric correctly. ECE is a binned estimator, so the resample's bin assignments need to match the condition's bin definitions: ```python from eval_toolkit import expected_calibration_error_l2_debiased, paired_bootstrap_ece_diff # Using two probability-score scorers in [0, 1]. diff_ece = paired_bootstrap_ece_diff( y, s_a, s_b, ece_fn=expected_calibration_error_l2_debiased, n_bins=10, n_resamples=1000, seed=42, ) print(f"Δ ECE: {diff_ece.delta:.4f} CI [{diff_ece.ci_low:.4f}, {diff_ece.ci_high:.4f}]") ``` (two-level-bootstrap)= ## Operating-point differences (two-level bootstrap) When the metric is *operating-point-dependent* (F1, precision at fixed recall) and your deployment refits the threshold per data batch, use the two-level paired bootstrap from [thresholds.md §"When to refit threshold per resample"](thresholds.md#bootstrap-refit). The interval is wider — it captures both metric variance AND threshold-selection variance — which is the honest story when the deployment isn't fixing the threshold once and forever. (comparison-mde)= ## MDE: "we couldn't detect a difference" claims A wide CI that overlaps zero is *not* evidence the two models perform the same — it's evidence you don't have enough data to tell. Quantify this with the **minimum detectable effect (MDE)**: the smallest true difference that your bootstrap-CI procedure would have detected with 80 % power. ```python mde = mde_from_ci(diff, alpha=0.05, power=0.80) print(f"MDE @ 80 % power: {mde.mde:.4f}") print(f"interpretation: differences smaller than ~{mde.mde:.3f} would not " "have shown up as significant with this n.") ``` If the MDE is comparable to or larger than the differences you care about, you need more eval data — running a different statistical test won't help. (comparison-cv-ci)= ## CV-CI: confidence intervals from K-fold For K-fold CV results, use [`cv_clt_ci`](../api/bootstrap.md): a CLT-corrected confidence interval over per-fold metric values. This is what [`evaluate_folded`](../api/harness.md) auto-computes and stores in `RunResult.fold_summary`. ```python fold_metrics = np.array([0.74, 0.76, 0.71, 0.78, 0.73]) ci = cv_clt_ci(fold_metrics, confidence=0.95) print(f"CV mean: {ci.point_estimate:.3f} CI [{ci.ci_low:.3f}, {ci.ci_high:.3f}]") ``` The CLT correction (Bates et al. 2024) accounts for the fact that per-fold metrics are *not* independent — they share training data. Naive Student's-t CIs over fold metrics are anti-conservative. (comparison-out-of-scope)= ## What's NOT in eval-toolkit (and why) Two classical paired tests are *deliberately* out of scope: - **McNemar's test.** Compares the proportion of *disagreements* between two binary classifiers — A right / B wrong vs A wrong / B right. Use when you have hard predictions, not probability scores. Compute via [`scipy.stats.contingency`](https://docs.scipy.org/doc/scipy/reference/stats.contingency.html) + the McNemar `2×2` table. - **DeLong's test.** Compares ROC-AUC between two scorers using the Mann-Whitney form's variance. Specific to ROC-AUC; doesn't generalize to PR-AUC or threshold metrics. Several Python implementations (e.g., `pyroc-utils`, manual [DeLong implementations](https://github.com/yandexdataschool/roc_comparison) on GitHub). Neither pays rent in eval-toolkit because: 1. **Bootstrap covers the same ground.** `paired_bootstrap_diff` gives a CI on any metric difference; McNemar and DeLong are special cases for binary predictions and ROC-AUC respectively. 2. **They don't generalize.** DeLong is ROC-AUC-only; McNemar is hard- prediction-only. The toolkit's bootstrap framework is metric- agnostic. 3. **Multiple-testing correction.** When comparing K > 2 models, bootstrap-CI on every pair is straightforward; McNemar / DeLong require explicit Bonferroni / FDR corrections. If you need them anyway, both are fine to compute alongside eval-toolkit — they'll generally agree with the bootstrap result on informative data. (comparison-pitfalls)= ## Pitfalls / Common mistakes - **Comparing two unpaired CIs visually.** "B's CI starts above A's CI ceiling, so B is better." Mathematically: non-overlap implies significance, but overlap does NOT imply non-significance. Always compute the paired CI on the difference. - **Treating "overlaps zero" as "no difference".** Wide CIs are about insufficient power, not equivalence. Report MDE alongside. - **Bootstrapping accuracy on small slices.** BCa's jackknife step becomes degenerate when n is small (n < 30 is the toolkit's bright-line). The toolkit emits an error in that regime; fall back to `method="percentile"` and document the choice. - **Comparing ECE bootstrap CIs across runs with different n_bins.** As in [calibration.md](calibration.md), ECE depends on bin count; bootstrap inherits that sensitivity. - **Reporting bootstrap mean instead of point estimate.** The toolkit's `BootstrapCI.point_estimate` is the metric on the *original* data, not the resample mean. Report `point_estimate ± CI`, not `resample_mean ± CI`. ## Putting it all together Full A-vs-B comparison report: ```python ci_a = bootstrap_ci(y, s_a, pr_auc, n_resamples=1000, seed=42) ci_b = bootstrap_ci(y, s_b, pr_auc, n_resamples=1000, seed=42) diff = paired_bootstrap_diff(y, s_a, s_b, pr_auc, n_resamples=1000, seed=42) mde = mde_from_ci(diff, alpha=0.05, power=0.80) print(f"A: PR-AUC={ci_a.point_estimate:.3f} CI [{ci_a.ci_low:.3f}, {ci_a.ci_high:.3f}]") print(f"B: PR-AUC={ci_b.point_estimate:.3f} CI [{ci_b.ci_low:.3f}, {ci_b.ci_high:.3f}]") print(f"Δ: {diff.delta:.3f} CI [{diff.ci_low:.3f}, {diff.ci_high:.3f}]" f" overlaps_zero={diff.overlaps_zero}") print(f"MDE @ 80 % power: {mde.mde:.4f}") ``` ## Further reading - Efron, B. & Tibshirani, R. *An Introduction to the Bootstrap.* Chapman & Hall, 1993 — the canonical reference; BCa derived in §14. - DiCiccio, T. & Efron, B. *Bootstrap confidence intervals.* Statistical Science 11(3), 1996 — comparison of CI methods. - Bates, S., Hastie, T., & Tibshirani, R. *Cross-validation: what does it estimate and how well does it do it?* JASA 2024 — the basis for `cv_clt_ci`'s CLT correction. - DeLong, E. R. et al. *Comparing the areas under two or more correlated ROC curves: a nonparametric approach.* Biometrics 44, 1988. See also: [thresholds.md](thresholds.md) (two-level bootstrap), [calibration.md](calibration.md) (ECE differences), [testing.md](testing.md) (property-test patterns for invariants).