Roadmap#
Forward-looking tracker for eval-toolkit. Cross-links the consumer
gap docs that motivate upstream work and records the criteria for
v1.0.0.
This document is descriptive of intent, not a commitment. The priorities reflect today’s understanding of what consumers need; order may change as feedback comes in.
Currently shipped (as of v0.36.0)#
See CHANGELOG.md for the full release history.
Highlights since v0.33:
v0.34.0 — Phase 4 stats unblockers (CV-aware block bootstrap, generalized
mde_from_ci, multi-comparisons correction); unified internalparallel_maphelper +n_jobskwarg on all 5 public bootstrap functions; cookbook docs (3 compositional patterns). Closed the entire prior backlog in one release.v0.35.0 —
fit_temperature_binary(scalar-proba adapter for binary calibration; closes #28); Scorer picklability ADR inmethodology/parallelism.md(unblocks v0.36 harness parallelism).v0.36.0 —
evaluate(n_jobs=)+evaluate_folded(n_jobs=)wire the unified parallelism pattern into the harness loop (closes #29, #30). CI actions bumped to Node 24 ahead of the 2026-06-02 deprecation.
State-of-the-toolkit:
5 Tier-2 Protocols (
Scorer,LeakageCheck,Splitter,ThresholdSelector,DatasetLoader) + 1 opt-in (Versioned).Reference impls: 6 selectors, 7 leakage checks (incl.
NormalizedFormLeakageCheckfor encoding-obfuscated dupes), 5 splitters (incl.SourceDisjointKFoldSplitter), 4 loaders.RunManifestwith NeurIPS Reproducibility Checklist alignment + Croissant-compatible loader metadata.Versioned JSON schemas at
src/eval_toolkit/schemas/.Multi-file methodology curriculum: 16 chapters covering leakage, splits, thresholds, calibration, comparison, fairness, reproducibility, testing, bootstrap, text dedup, versioning, length stratification, artifacts, claims, evidence, parallelism.
Reference-equivalence tests against sklearn / scipy for the wrapped primitives (
pr_auc,roc_auc,brier_score,reliability_curve,bootstrap_ci,fit_isotonic_calibrator,fit_platt_calibrator).90 % global coverage gate; per-module breakdown in CI.
Sybil-validated doc-blocks across
docs/source/methodology/,docs/source/extending.md,docs/source/migration/,docs/source/examples/,README.md.Per-version migration guides (
migration/v0.7.md,migration/v0.8.md,migration/v0.9.md)general
MIGRATION.md.
Consumer gap docs (input)#
If you maintain a downstream consumer of eval-toolkit and have an
upstream wish, the convention is to put a docs/eval_toolkit_gaps.md
in your repo, open an issue or PR against eval-toolkit linking it,
and we’ll reconcile against the tracked-candidates list below.
Historical gap-closure status (Gaps 1–4 from the v0.7 era) is
preserved in CHANGELOG entries for v0.7.x / v0.8.0.
Tracked candidates (see GitHub Issues)#
The previously-untracked “v0.9 candidates” list has been filed as GitHub Issues. Issue state is the source of truth; this section is a navigational gloss.
#35 (P2) —
TokenizationLeakageCheck(HF-tokenizer-aware dedup; complementsNormalizedFormLeakageCheck).#31 (P3) — Migrate
docs/source/examples/from static MD to executable myst-nb cells.#36 (P3) — Inline bootstrap CI on every metric (Inspect-AI / lm-eval scorecard pattern).
#37 (P3) — Restore per-module coverage floors (
seeds.py70 % due to optional torch path).#38 (P3) — CI doctests for
paths.py/provenance.py/seeds.py/docs.py.
Run gh issue list -R brandon-behring/eval-toolkit --label P2 or
--label P3 for live state.
v1-prelude evidence core#
The next stabilization step is the generic evidence layer now used by consumer migrations:
Validation-fit operating points can be applied to mixed-class, all-positive, or all-negative target slices with threshold provenance.
RunManifestcan carry optional source-role records and guardrails.Generic claim gates can fail missing headline comparisons, inadequate slice sizes, scorer/leakage errors, missing source roles, and metric caps such as hard-negative FPR.
These stay library-first: no prompt-injection datasets, presets, CLI, or markdown report generator.
v1.0.0 path (long-term, gated)#
v1.0.0 signals API stability — breaking changes after v1.0 require v2.0. Gated on:
Real consumer running v0.7+ in production for ≥ 1 review cycle. The canonical consumer is
prompt-injection-detection-submission. Otherprompt_injection_*/prompt-injection-*repos in the author’s workspace are experiments, scaffolds, or earlier prototypes — only the detection-submission repo gates v1.0.Protocol shapes survive ≥ 1 “should we change this?” review cycle. v0.7.x added 5 Tier-2 Protocols (
Scorer,LeakageCheck,Splitter,ThresholdSelector,DatasetLoader) + 1 opt-in (Versioned). As of v0.41.0, all six have been stable across 34 minor releases (v0.7 → v0.41) except for one contract-tightening edit toLeakageCheck.namein v0.39.0 (#40) — changing the Protocol declaration from a settable class-level attribute to a@propertyto align with the@dataclass(frozen=True)implementation pattern. v0.40.0 and v0.41.0 both shipped without Protocol shape edits (v0.40:fit_platt_binary+fit_beta_binaryadditions; v0.41:HFDatasetsLoaderenrichment — neither touched Tier-2 Protocols). The stability window is now 2 of 2 minors without Protocol edits as of v0.41.0 — Gate 2 ✅ MET. The v1-prelude evidence APIs must also survive one real-consumer migration check (theprompt-injection-detection-submissionrepo).Methodology docs peer-reviewed by an external reader (statistics / methodology background, ideally not part of the
prompt_injection_*core team).Croissant interop verified end-to-end — ✅ MET as of v0.41.0 (see
tests/test_croissant_e2e.py).HFDatasetsLoader.describe()fetches Croissant metadata + per-filesha256from HF Hub; the integration test downloads a real parquet shard fromstanfordnlp/sst2and verifies the bytes hash bit-exactly to the valuedescribe()reports. Caveat: HF Hub’s Croissant emitter currently puntsdistribution[].sha256(per the still-open MLCommons Croissant spec issue #80), soHFDatasetsLoaderreads sha256 from HF Hub’s tree API (lfs.oid) today. When #80 resolves and HF Hub starts populating Croissantsha256with real values, the loader will pick up the new source automatically. Seemethodology/reproducibility.md§”Croissant interoperability” for the design.
When v1.0 ships:
API surface freezes. Breaking changes require a v2.0 major bump.
The five Tier-2 Protocols become contracts (no method-shape changes, only additive subprotocols).
The JSON schemas (
schemas/*.v1.json) become the canonical contract; any breaking change to a schema bumps to*.v2.json.
Out of scope (deliberately)#
These are valuable but not on the roadmap:
Native fairness metrics (demographic parity, equalized odds, calibration parity). Consumer computes via fairlearn
the toolkit’s slicing primitives; eval-toolkit shouldn’t duplicate.
McNemar / DeLong tests. Consumer computes via
scipy.stats; the toolkit’s bootstrap framework covers the same ground for arbitrary metrics.Common metrics (MCC, Cohen’s kappa, balanced accuracy, log-loss). Design intent keeps the metric set focused on the four headline primitives + ECE family + threshold selection. Consumers add what they need.
CLI. The toolkit is a library; consumer projects build their own CLI (e.g., the
prompt_injection_*repos’evaluate.pyscripts).A formal plugin registry / setuptools entry-points system. The Protocol-based seam is sufficient.
Optional
fit_platt_calibrator(canonical: bool = True)flag. v0.3.0 already canonicalized the impl per Platt 1999 §2.2; the flag would re-introduce the non-canonical variant for backward compatibility but no consumer demand has surfaced. WONTFIX unless asked.
How to file an upstream wish#
Add a section in your project’s
docs/eval_toolkit_gaps.mddescribing the gap, severity, and a sketch of the patch (if known).Open an issue or PR against
eval-toolkit’s GitHub linking that gaps doc.If you’ve done the work locally, the PR can be a draft with the suggested patch + tests; we’ll reconcile against this roadmap.
See also#
CHANGELOG.md— release history.docs/MIGRATION.md— version-to-version migration guides.docs/methodology/reading_list.md— citation-level “future work” pointers (statistical methods that could land if there’s appetite).