Comparison & confidence intervals#
Background (skip if you’ve internalized this). A point estimate of a metric without a confidence interval is a vibe. Two-model comparisons that don’t account for paired data (same eval rows, two models) lose power. “Model B is better than A” is a statistical claim that requires a CI on the difference, not just two separate CIs that happen to look apart. This chapter covers the bootstrap-based machinery eval-toolkit ships and when to reach for things it doesn’t (McNemar, DeLong).
Setup#
import numpy as np
from eval_toolkit import (
bootstrap_ci, paired_bootstrap_diff, mde_from_ci, cv_clt_ci,
pr_auc, roc_auc,
)
Two synthetic scorers: A (random) and B (informative).
rng = np.random.default_rng(42)
y = rng.binomial(1, 0.3, size=300).astype(int)
s_a = rng.uniform(0, 1, size=300)
s_b = np.clip(0.6 * y + rng.normal(0, 0.25, size=300), 0, 1)
Single-condition CI: BCa vs percentile#
A bootstrap CI for a metric resamples paired (y_true, y_score) indices
with replacement, computes the metric per resample, and reports a
quantile-based interval over the resampled distribution.
eval-toolkit’s
bootstrap_ci supports three
methods:
BCa (bias-corrected, accelerated) — the default. Adjusts the quantile cuts for skew in the resampled distribution. Most accurate for moderate sample sizes (n ≥ 30); the standard recommendation in Efron & Tibshirani (1993).
percentile — naive quantile cut, no bias correction. Slightly optimistic on skewed distributions but the recommended fallback when BCa’s jackknife step is degenerate (very small n, constant scores).
studentized — bootstrap-t. Rare in ML eval; ships for completeness.
ci = bootstrap_ci(y, s_b, pr_auc, n_resamples=1000, method="BCa", seed=42)
print(f"BCa: PR-AUC={ci.point_estimate:.3f} CI [{ci.ci_low:.3f}, {ci.ci_high:.3f}]")
ci_p = bootstrap_ci(y, s_b, pr_auc, n_resamples=1000, method="percentile", seed=42)
print(f"pct: PR-AUC={ci_p.point_estimate:.3f} CI [{ci_p.ci_low:.3f}, {ci_p.ci_high:.3f}]")
What NOT to do. Don’t fall back to percentile by default. BCa is almost always the right choice for ML eval; reserve percentile for the explicit fallback case (
n < 30, constant scores, BCa raises).
Paired bootstrap for two-model comparison#
When comparing two models on the same eval set, the metric difference \(\Delta = M_B - M_A\) has less variance than each metric individually because the resampling noise cancels — the same row that’s hard for A is also hard for B. Use paired bootstrap to exploit this.
eval-toolkit’s
paired_bootstrap_diff shares
resample indices across both scorers — the same rows enter each
resample for both A and B.
diff = paired_bootstrap_diff(
y, s_a, s_b, pr_auc, n_resamples=1000, seed=42,
)
print(f"Δ PR-AUC: {diff.delta:.3f} CI [{diff.ci_low:.3f}, {diff.ci_high:.3f}]")
print(f" overlaps zero: {diff.overlaps_zero}")
The overlaps_zero field is the headline finding: when the CI doesn’t
contain 0, you have evidence that B is better (or worse) than A at the
chosen confidence level. When it does, you don’t.
Why paired CIs are tighter. If you computed two unpaired CIs independently and just visually compared them, you would under- reject: there’s a well-known result that non-overlapping CIs imply a significant difference, but overlapping CIs do not imply non-significance. Always compute the paired CI explicitly.
ECE differences#
For ECE differences, use
paired_bootstrap_ece_diff
which threads n_bins through the metric correctly. ECE is a binned
estimator, so the resample’s bin assignments need to match the
condition’s bin definitions:
from eval_toolkit import expected_calibration_error_l2_debiased, paired_bootstrap_ece_diff
# Using two probability-score scorers in [0, 1].
diff_ece = paired_bootstrap_ece_diff(
y, s_a, s_b,
ece_fn=expected_calibration_error_l2_debiased,
n_bins=10, n_resamples=1000, seed=42,
)
print(f"Δ ECE: {diff_ece.delta:.4f} CI [{diff_ece.ci_low:.4f}, {diff_ece.ci_high:.4f}]")
Operating-point differences (two-level bootstrap)#
When the metric is operating-point-dependent (F1, precision at fixed recall) and your deployment refits the threshold per data batch, use the two-level paired bootstrap from thresholds.md §”When to refit threshold per resample”.
The interval is wider — it captures both metric variance AND threshold-selection variance — which is the honest story when the deployment isn’t fixing the threshold once and forever.
MDE: “we couldn’t detect a difference” claims#
A wide CI that overlaps zero is not evidence the two models perform the same — it’s evidence you don’t have enough data to tell. Quantify this with the minimum detectable effect (MDE): the smallest true difference that your bootstrap-CI procedure would have detected with 80 % power.
mde = mde_from_ci(diff, alpha=0.05, power=0.80)
print(f"MDE @ 80 % power: {mde.mde:.4f}")
print(f"interpretation: differences smaller than ~{mde.mde:.3f} would not "
"have shown up as significant with this n.")
If the MDE is comparable to or larger than the differences you care about, you need more eval data — running a different statistical test won’t help.
CV-CI: confidence intervals from K-fold#
For K-fold CV results, use
cv_clt_ci: a CLT-corrected
confidence interval over per-fold metric values. This is what
evaluate_folded auto-computes
and stores in RunResult.fold_summary.
fold_metrics = np.array([0.74, 0.76, 0.71, 0.78, 0.73])
ci = cv_clt_ci(fold_metrics, confidence=0.95)
print(f"CV mean: {ci.point_estimate:.3f} CI [{ci.ci_low:.3f}, {ci.ci_high:.3f}]")
The CLT correction (Bates et al. 2024) accounts for the fact that per-fold metrics are not independent — they share training data. Naive Student’s-t CIs over fold metrics are anti-conservative.
What’s NOT in eval-toolkit (and why)#
Two classical paired tests are deliberately out of scope:
McNemar’s test. Compares the proportion of disagreements between two binary classifiers — A right / B wrong vs A wrong / B right. Use when you have hard predictions, not probability scores. Compute via
scipy.stats.contingencythe McNemar
2×2table.
DeLong’s test. Compares ROC-AUC between two scorers using the Mann-Whitney form’s variance. Specific to ROC-AUC; doesn’t generalize to PR-AUC or threshold metrics. Several Python implementations (e.g.,
pyroc-utils, manual DeLong implementations on GitHub).
Neither pays rent in eval-toolkit because:
Bootstrap covers the same ground.
paired_bootstrap_diffgives a CI on any metric difference; McNemar and DeLong are special cases for binary predictions and ROC-AUC respectively.They don’t generalize. DeLong is ROC-AUC-only; McNemar is hard- prediction-only. The toolkit’s bootstrap framework is metric- agnostic.
Multiple-testing correction. When comparing K > 2 models, bootstrap-CI on every pair is straightforward; McNemar / DeLong require explicit Bonferroni / FDR corrections.
If you need them anyway, both are fine to compute alongside eval-toolkit — they’ll generally agree with the bootstrap result on informative data.
Pitfalls / Common mistakes#
Comparing two unpaired CIs visually. “B’s CI starts above A’s CI ceiling, so B is better.” Mathematically: non-overlap implies significance, but overlap does NOT imply non-significance. Always compute the paired CI on the difference.
Treating “overlaps zero” as “no difference”. Wide CIs are about insufficient power, not equivalence. Report MDE alongside.
Bootstrapping accuracy on small slices. BCa’s jackknife step becomes degenerate when n is small (n < 30 is the toolkit’s bright-line). The toolkit emits an error in that regime; fall back to
method="percentile"and document the choice.Comparing ECE bootstrap CIs across runs with different n_bins. As in calibration.md, ECE depends on bin count; bootstrap inherits that sensitivity.
Reporting bootstrap mean instead of point estimate. The toolkit’s
BootstrapCI.point_estimateis the metric on the original data, not the resample mean. Reportpoint_estimate ± CI, notresample_mean ± CI.
Putting it all together#
Full A-vs-B comparison report:
ci_a = bootstrap_ci(y, s_a, pr_auc, n_resamples=1000, seed=42)
ci_b = bootstrap_ci(y, s_b, pr_auc, n_resamples=1000, seed=42)
diff = paired_bootstrap_diff(y, s_a, s_b, pr_auc, n_resamples=1000, seed=42)
mde = mde_from_ci(diff, alpha=0.05, power=0.80)
print(f"A: PR-AUC={ci_a.point_estimate:.3f} CI [{ci_a.ci_low:.3f}, {ci_a.ci_high:.3f}]")
print(f"B: PR-AUC={ci_b.point_estimate:.3f} CI [{ci_b.ci_low:.3f}, {ci_b.ci_high:.3f}]")
print(f"Δ: {diff.delta:.3f} CI [{diff.ci_low:.3f}, {diff.ci_high:.3f}]"
f" overlaps_zero={diff.overlaps_zero}")
print(f"MDE @ 80 % power: {mde.mde:.4f}")
Further reading#
Efron, B. & Tibshirani, R. An Introduction to the Bootstrap. Chapman & Hall, 1993 — the canonical reference; BCa derived in §14.
DiCiccio, T. & Efron, B. Bootstrap confidence intervals. Statistical Science 11(3), 1996 — comparison of CI methods.
Bates, S., Hastie, T., & Tibshirani, R. Cross-validation: what does it estimate and how well does it do it? JASA 2024 — the basis for
cv_clt_ci’s CLT correction.DeLong, E. R. et al. Comparing the areas under two or more correlated ROC curves: a nonparametric approach. Biometrics 44, 1988.
See also: thresholds.md (two-level bootstrap), calibration.md (ECE differences), testing.md (property-test patterns for invariants).