# Reading list Canonical references for the methodology this toolkit operationalizes. Listed in rough order of operational relevance: the things you'll find yourself citing in production-eval write-ups go first. ## Core methodology - **Kapoor, S. & Narayanan, A.** *Leakage and the reproducibility crisis in machine-learning-based science.* Patterns 4(9), 2023. [arXiv:2207.07048](https://arxiv.org/abs/2207.07048). *The 8-type leakage taxonomy adopted by [leakage.md](leakage.md). 294 papers across 17 fields where leakage was the cause of non-replication.* - **Efron, B. & Tibshirani, R.** *An Introduction to the Bootstrap.* Chapman & Hall / CRC, 1993. *§14 derives BCa; the canonical reference for [comparison.md](comparison.md)'s bootstrap CIs.* - **Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q.** *On Calibration of Modern Neural Networks.* ICML 2017. [arXiv:1706.04599](https://arxiv.org/abs/1706.04599). *Temperature scaling — the canonical post-hoc calibration method; see [calibration.md](calibration.md).* - **Naeini, M. P., Cooper, G. F., & Hauskrecht, M.** *Obtaining Well Calibrated Probabilities Using Bayesian Binning.* AAAI 2015. [arXiv:1411.0760](https://arxiv.org/abs/1411.0760). *ECE definition + the equal-mass-binning rationale used in `expected_calibration_error_equal_mass`.* - **Kumar, A., Liang, P., & Ma, T.** *Verified Uncertainty Calibration.* NeurIPS 2019. [arXiv:1909.10155](https://arxiv.org/abs/1909.10155). *Debiased ECE estimators — the basis for `expected_calibration_error_debiased` and the L2 variant.* ## Threshold selection & decision rules - **Lipton, Z., Elkan, C., & Naryanaswamy, B.** *Optimal thresholding of classifiers to maximize F1 measure.* ECML PKDD 2014. [arXiv:1402.1892](https://arxiv.org/abs/1402.1892). *Optimality proof for [`MaxF1Selector`](../api/thresholds.md).* - **Elkan, C.** *The foundations of cost-sensitive learning.* IJCAI 2001. *Bayes-optimal threshold derivation used by [`CostSensitiveSelector`](../api/thresholds.md).* - **Youden, W. J.** *Index for rating diagnostic tests.* Cancer 3(1), 1950. *Original Youden's J statistic ([`YoudenJSelector`](../api/thresholds.md)).* ## Splits & cross-validation - **Bates, S., Hastie, T., & Tibshirani, R.** *Cross-validation: what does it estimate and how well does it do it?* JASA 2024. *The CLT- corrected K-fold CI underlying [`cv_clt_ci`](../api/bootstrap.md).* - **Hastie, T., Tibshirani, R., & Friedman, J.** *The Elements of Statistical Learning.* §7.10. *Cross-validation methodology canon.* - **Yan, X. et al.** *Hidden Leaks in Time Series Forecasting.* arXiv, 2025. [arXiv:2512.06932](https://arxiv.org/html/2512.06932v1). *Validation-strategy leakage in time-series settings; relevant to [`TimeSeriesSplitter`](../api/splits.md) and [`TemporalLeakageCheck`](../api/leakage.md).* - **Pellizzoni, S. et al.** *Don't push the button! Data leakage risks in ML and transfer learning.* Springer AI Review, 2025. [DOI](https://link.springer.com/article/10.1007/s10462-025-11326-3). *Modern leakage taxonomy extending Kapoor & Narayanan; introduces transfer-learning leakage as an explicit class.* - **Recht, B., Roelofs, R., Schmidt, L., & Shankar, V.** *Do ImageNet classifiers generalize to ImageNet?* ICML 2019. *Empirical CV-vs- final-holdout divergence — the case study for why CV alone isn't an OOD claim.* ## Reproducibility - **NeurIPS Paper Checklist.** [neurips.cc/public/guides/PaperChecklist](https://neurips.cc/public/guides/PaperChecklist). *Manifest field alignment; see [reproducibility.md](reproducibility.md) §"NeurIPS mapping".* - **PyTorch 2.8 reproducibility notes.** [docs.pytorch.org/docs/stable/notes/randomness.html](https://docs.pytorch.org/docs/stable/notes/randomness.html). *Canonical citation for the four sharp edges in [reproducibility.md](reproducibility.md) §"PyTorch determinism".* - **Croissant: A Metadata Format for ML-Ready Datasets.** MLCommons, 2024. [arXiv:2403.19546](https://arxiv.org/abs/2403.19546). *Croissant-compatible metadata in [`DatasetLoader.describe()`](../api/loaders.md).* - **Pineau, J. et al.** *Improving reproducibility in machine learning research.* JMLR 22, 2021. *Practical guide alongside the NeurIPS checklist.* ## Prompt-injection eval (consumer-relevant) - **OWASP.** *LLM01:2025 Prompt Injection.* [genai.owasp.org](https://genai.owasp.org/llmrisk/llm01-prompt-injection/). *Slice taxonomy used in [`prompt_injection_walkthrough.md`](../examples/prompt_injection_walkthrough.md): direct, indirect, encoded/obfuscated, system-prompt-leak, multi-stage.* - **PI_HackAPrompt_SQuAD analysis (2025).** [arXiv:2505.04806](https://arxiv.org/html/2505.04806v1). *21.3 % naive-dedup detection vs 76.2 % attack-success-rate finding motivating [`NormalizedFormLeakageCheck`](../api/leakage.md).* - **DataSentinel + PromptLocate.** [arXiv:2511.15759](https://arxiv.org/abs/2511.15759). *Strict-normalization contamination checks for prompt-injection benchmarks.* - **Lakera PINT benchmark.** [github.com/lakeraai/pint-benchmark](https://github.com/lakeraai/pint-benchmark). *Canonical detection benchmark — the dataset used by the worked example's "full walkthrough" cross-link.* - **Open-Prompt-Injection (Liu et al.).** [github.com/liu00222/Open-Prompt-Injection](https://github.com/liu00222/Open-Prompt-Injection). *Reference dataset of attack prompts.* ## Eval harness ecosystem - **EleutherAI lm-evaluation-harness.** [github.com/EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). *Source of the per-task `VERSION` field pattern adopted by [`Versioned`](../api/leakage.md).* - **UK AISI Inspect AI.** [inspect.aisi.org.uk](https://inspect.aisi.org.uk/). *Reference architecture for safety-eval harness Scorer/Solver separation; cross-link from [`extending.md`](../extending.md).* - **Stanford HELM.** [github.com/stanford-crfm/helm](https://github.com/stanford-crfm/helm). *Reference for benchmark-schema-as-versioned-artifact pattern.* ## Fairness - **Hardt, M., Price, E., & Srebro, N.** *Equality of Opportunity in Supervised Learning.* NeurIPS 2016. *Equalized odds — [fairness.md](fairness.md) §"Equalized odds".* - **Kleinberg, J., Mullainathan, S., & Raghavan, M.** *Inherent Trade-offs in the Fair Determination of Risk Scores.* ITCS 2017. [arXiv:1609.05807](https://arxiv.org/abs/1609.05807). *Incompatibility of fairness criteria.* - **Mitchell, M. et al.** *Model Cards for Model Reporting.* FAccT 2019. *Documentation pattern that consumes per-subgroup metrics.* ## Statistical comparison (deferred from v0.7.0) - **DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L.** *Comparing the areas under two or more correlated ROC curves: a nonparametric approach.* Biometrics 44, 1988. *Out-of-scope alternative to bootstrap CI on ROC-AUC differences; use [`scipy.stats`](https://docs.scipy.org/doc/scipy/reference/stats.html) + a manual implementation if required.* - **DiCiccio, T. J. & Efron, B.** *Bootstrap confidence intervals.* Statistical Science 11(3), 1996. *Comparison of CI methods.* ## Future work for eval-toolkit Items called out as out-of-scope in the v0.7.0 plan, with citations that motivate them: - **Inline bootstrap CI on every metric.** Inspect AI / lm-eval-harness pattern; appropriate for scorecard-oriented harnesses. - **McNemar / DeLong as named functions.** Currently consumers compute these on top of the bootstrap framework; see [comparison.md §"What's NOT in eval-toolkit"](comparison.md#out-of-scope). - **Full Croissant production by `DatasetLoader`.** Currently the `describe()` output is a Croissant-compatible *subset*; full production requires JSON-LD generation and schema validation against the Croissant spec. - **Native fairness metrics.** Demographic parity, equalized odds, calibration parity. Pointers to `fairlearn` and `aequitas` instead; see [fairness.md](fairness.md). - **Property tests for new v0.7.0 modules** — restoring the 90 % coverage gate. Tracked for v0.7.1. ## Cross-link The v0.3 research audit (`docs/v0.3_research_audit.md`) is the defensive review of every public method — literature review, gap analysis, B+ industry rating against sklearn / scipy / statsmodels. Read it once when you need to defend a methodological choice; the chapters in this directory are the *prescriptive* counterpart to the audit's *descriptive* layer.