Reading list#

Canonical references for the methodology this toolkit operationalizes. Listed in rough order of operational relevance: the things you’ll find yourself citing in production-eval write-ups go first.

Core methodology#

  • Kapoor, S. & Narayanan, A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4(9), 2023. arXiv:2207.07048. The 8-type leakage taxonomy adopted by leakage.md. 294 papers across 17 fields where leakage was the cause of non-replication.

  • Efron, B. & Tibshirani, R. An Introduction to the Bootstrap. Chapman & Hall / CRC, 1993. §14 derives BCa; the canonical reference for comparison.md’s bootstrap CIs.

  • Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. On Calibration of Modern Neural Networks. ICML 2017. arXiv:1706.04599. Temperature scaling — the canonical post-hoc calibration method; see calibration.md.

  • Naeini, M. P., Cooper, G. F., & Hauskrecht, M. Obtaining Well Calibrated Probabilities Using Bayesian Binning. AAAI 2015. arXiv:1411.0760. ECE definition + the equal-mass-binning rationale used in expected_calibration_error_equal_mass.

  • Kumar, A., Liang, P., & Ma, T. Verified Uncertainty Calibration. NeurIPS 2019. arXiv:1909.10155. Debiased ECE estimators — the basis for expected_calibration_error_debiased and the L2 variant.

Threshold selection & decision rules#

  • Lipton, Z., Elkan, C., & Naryanaswamy, B. Optimal thresholding of classifiers to maximize F1 measure. ECML PKDD 2014. arXiv:1402.1892. Optimality proof for MaxF1Selector.

  • Elkan, C. The foundations of cost-sensitive learning. IJCAI 2001. Bayes-optimal threshold derivation used by CostSensitiveSelector.

  • Youden, W. J. Index for rating diagnostic tests. Cancer 3(1), 1950. Original Youden’s J statistic (YoudenJSelector).

Splits & cross-validation#

  • Bates, S., Hastie, T., & Tibshirani, R. Cross-validation: what does it estimate and how well does it do it? JASA 2024. The CLT- corrected K-fold CI underlying cv_clt_ci.

  • Hastie, T., Tibshirani, R., & Friedman, J. The Elements of Statistical Learning. §7.10. Cross-validation methodology canon.

  • Yan, X. et al. Hidden Leaks in Time Series Forecasting. arXiv, 2025. arXiv:2512.06932. Validation-strategy leakage in time-series settings; relevant to TimeSeriesSplitter and TemporalLeakageCheck.

  • Pellizzoni, S. et al. Don’t push the button! Data leakage risks in ML and transfer learning. Springer AI Review, 2025. DOI. Modern leakage taxonomy extending Kapoor & Narayanan; introduces transfer-learning leakage as an explicit class.

  • Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. Do ImageNet classifiers generalize to ImageNet? ICML 2019. Empirical CV-vs- final-holdout divergence — the case study for why CV alone isn’t an OOD claim.

Reproducibility#

Prompt-injection eval (consumer-relevant)#

Eval harness ecosystem#

Fairness#

  • Hardt, M., Price, E., & Srebro, N. Equality of Opportunity in Supervised Learning. NeurIPS 2016. Equalized odds — fairness.md §”Equalized odds”.

  • Kleinberg, J., Mullainathan, S., & Raghavan, M. Inherent Trade-offs in the Fair Determination of Risk Scores. ITCS 2017. arXiv:1609.05807. Incompatibility of fairness criteria.

  • Mitchell, M. et al. Model Cards for Model Reporting. FAccT 2019. Documentation pattern that consumes per-subgroup metrics.

Statistical comparison (deferred from v0.7.0)#

  • DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. Comparing the areas under two or more correlated ROC curves: a nonparametric approach. Biometrics 44, 1988. *Out-of-scope alternative to bootstrap CI on ROC-AUC differences; use scipy.stats

    • a manual implementation if required.*

  • DiCiccio, T. J. & Efron, B. Bootstrap confidence intervals. Statistical Science 11(3), 1996. Comparison of CI methods.

Future work for eval-toolkit#

Items called out as out-of-scope in the v0.7.0 plan, with citations that motivate them:

  • Inline bootstrap CI on every metric. Inspect AI / lm-eval-harness pattern; appropriate for scorecard-oriented harnesses.

  • McNemar / DeLong as named functions. Currently consumers compute these on top of the bootstrap framework; see comparison.md §”What’s NOT in eval-toolkit”.

  • Full Croissant production by DatasetLoader. Currently the describe() output is a Croissant-compatible subset; full production requires JSON-LD generation and schema validation against the Croissant spec.

  • Native fairness metrics. Demographic parity, equalized odds, calibration parity. Pointers to fairlearn and aequitas instead; see fairness.md.

  • Property tests for new v0.7.0 modules — restoring the 90 % coverage gate. Tracked for v0.7.1.