# Roadmap Forward-looking tracker for `eval-toolkit`. Cross-links the consumer gap docs that motivate upstream work and records the criteria for v1.0.0. This document is **descriptive of intent, not a commitment**. The priorities reflect today's understanding of what consumers need; order may change as feedback comes in. ## Currently shipped (as of v0.36.0) See [`CHANGELOG.md`](https://github.com/brandon-behring/eval-toolkit/blob/main/CHANGELOG.md) for the full release history. Highlights since v0.33: - **v0.34.0** — Phase 4 stats unblockers (CV-aware block bootstrap, generalized `mde_from_ci`, multi-comparisons correction); unified internal `parallel_map` helper + `n_jobs` kwarg on all 5 public bootstrap functions; cookbook docs (3 compositional patterns). Closed the entire prior backlog in one release. - **v0.35.0** — `fit_temperature_binary` (scalar-proba adapter for binary calibration; closes #28); Scorer picklability ADR in `methodology/parallelism.md` (unblocks v0.36 harness parallelism). - **v0.36.0** — `evaluate(n_jobs=)` + `evaluate_folded(n_jobs=)` wire the unified parallelism pattern into the harness loop (closes #29, #30). CI actions bumped to Node 24 ahead of the 2026-06-02 deprecation. State-of-the-toolkit: - 5 Tier-2 Protocols (`Scorer`, `LeakageCheck`, `Splitter`, `ThresholdSelector`, `DatasetLoader`) + 1 opt-in (`Versioned`). - Reference impls: 6 selectors, 7 leakage checks (incl. `NormalizedFormLeakageCheck` for encoding-obfuscated dupes), 5 splitters (incl. `SourceDisjointKFoldSplitter`), 4 loaders. - `RunManifest` with NeurIPS Reproducibility Checklist alignment + Croissant-compatible loader metadata. - Versioned JSON schemas at `src/eval_toolkit/schemas/`. - Multi-file methodology curriculum: [16 chapters](methodology/README.md) covering leakage, splits, thresholds, calibration, comparison, fairness, reproducibility, testing, bootstrap, text dedup, versioning, length stratification, artifacts, claims, evidence, parallelism. - Reference-equivalence tests against sklearn / scipy for the wrapped primitives (`pr_auc`, `roc_auc`, `brier_score`, `reliability_curve`, `bootstrap_ci`, `fit_isotonic_calibrator`, `fit_platt_calibrator`). - 90 % global coverage gate; per-module breakdown in CI. - Sybil-validated doc-blocks across `docs/source/methodology/`, `docs/source/extending.md`, `docs/source/migration/`, `docs/source/examples/`, `README.md`. - Per-version migration guides ([`migration/v0.7.md`](migration/v0.7.md), [`migration/v0.8.md`](migration/v0.8.md), [`migration/v0.9.md`](migration/v0.9.md)) + general [`MIGRATION.md`](MIGRATION.md). ## Consumer gap docs (input) If you maintain a downstream consumer of eval-toolkit and have an upstream wish, the convention is to put a `docs/eval_toolkit_gaps.md` in your repo, open an issue or PR against `eval-toolkit` linking it, and we'll reconcile against the tracked-candidates list below. Historical gap-closure status (Gaps 1–4 from the v0.7 era) is preserved in CHANGELOG entries for v0.7.x / v0.8.0. ## Tracked candidates (see GitHub Issues) The previously-untracked "v0.9 candidates" list has been filed as GitHub Issues. Issue state is the source of truth; this section is a navigational gloss. - [#35](https://github.com/brandon-behring/eval-toolkit/issues/35) (P2) — `TokenizationLeakageCheck` (HF-tokenizer-aware dedup; complements `NormalizedFormLeakageCheck`). - [#31](https://github.com/brandon-behring/eval-toolkit/issues/31) (P3) — Migrate `docs/source/examples/` from static MD to executable myst-nb cells. - [#36](https://github.com/brandon-behring/eval-toolkit/issues/36) (P3) — Inline bootstrap CI on every metric (Inspect-AI / lm-eval scorecard pattern). - [#37](https://github.com/brandon-behring/eval-toolkit/issues/37) (P3) — Restore per-module coverage floors (`seeds.py` 70 % due to optional torch path). - [#38](https://github.com/brandon-behring/eval-toolkit/issues/38) (P3) — CI doctests for `paths.py` / `provenance.py` / `seeds.py` / `docs.py`. Run `gh issue list -R brandon-behring/eval-toolkit --label P2` or `--label P3` for live state. ## v1-prelude evidence core The next stabilization step is the generic evidence layer now used by consumer migrations: - Validation-fit operating points can be applied to mixed-class, all-positive, or all-negative target slices with threshold provenance. - `RunManifest` can carry optional source-role records and guardrails. - Generic claim gates can fail missing headline comparisons, inadequate slice sizes, scorer/leakage errors, missing source roles, and metric caps such as hard-negative FPR. These stay library-first: no prompt-injection datasets, presets, CLI, or markdown report generator. ## v1.0.0 path (long-term, gated) v1.0.0 signals API stability — breaking changes after v1.0 require v2.0. Gated on: 1. **Real consumer running v0.7+ in production for ≥ 1 review cycle.** The canonical consumer is `prompt-injection-detection-submission`. Other `prompt_injection_*` / `prompt-injection-*` repos in the author's workspace are experiments, scaffolds, or earlier prototypes — only the detection-submission repo gates v1.0. 2. **Protocol shapes survive ≥ 1 "should we change this?" review cycle.** v0.7.x added 5 Tier-2 Protocols (`Scorer`, `LeakageCheck`, `Splitter`, `ThresholdSelector`, `DatasetLoader`) + 1 opt-in (`Versioned`). As of v0.41.0, all six have been stable across 34 minor releases (v0.7 → v0.41) except for one contract-tightening edit to `LeakageCheck.name` in v0.39.0 (#40) — changing the Protocol declaration from a settable class-level attribute to a `@property` to align with the `@dataclass(frozen=True)` implementation pattern. v0.40.0 and v0.41.0 both shipped without Protocol shape edits (v0.40: `fit_platt_binary` + `fit_beta_binary` additions; v0.41: `HFDatasetsLoader` enrichment — neither touched Tier-2 Protocols). The stability window is now **2 of 2 minors without Protocol edits** as of v0.41.0 — Gate 2 ✅ **MET**. The v1-prelude evidence APIs must also survive one real-consumer migration check (the `prompt-injection-detection-submission` repo). 3. **Methodology docs peer-reviewed** by an external reader (statistics / methodology background, ideally not part of the `prompt_injection_*` core team). 4. **Croissant interop verified end-to-end** — ✅ **MET as of v0.41.0** (see `tests/test_croissant_e2e.py`). `HFDatasetsLoader.describe()` fetches Croissant metadata + per-file `sha256` from HF Hub; the integration test downloads a real parquet shard from `stanfordnlp/sst2` and verifies the bytes hash bit-exactly to the value `describe()` reports. **Caveat**: HF Hub's Croissant emitter currently punts `distribution[].sha256` (per the still-open MLCommons Croissant spec issue [#80](https://github.com/mlcommons/croissant/issues/80)), so `HFDatasetsLoader` reads sha256 from HF Hub's tree API (`lfs.oid`) today. When #80 resolves and HF Hub starts populating Croissant `sha256` with real values, the loader will pick up the new source automatically. See `methodology/reproducibility.md` §"Croissant interoperability" for the design. When v1.0 ships: - API surface freezes. Breaking changes require a v2.0 major bump. - The five Tier-2 Protocols become contracts (no method-shape changes, only additive subprotocols). - The JSON schemas (`schemas/*.v1.json`) become the canonical contract; any breaking change to a schema bumps to `*.v2.json`. ## Out of scope (deliberately) These are valuable but **not** on the roadmap: - **Native fairness metrics (demographic parity, equalized odds, calibration parity).** Consumer computes via [fairlearn](https://fairlearn.org/) + the toolkit's slicing primitives; eval-toolkit shouldn't duplicate. - **McNemar / DeLong tests.** Consumer computes via `scipy.stats`; the toolkit's bootstrap framework covers the same ground for arbitrary metrics. - **Common metrics (MCC, Cohen's kappa, balanced accuracy, log-loss).** Design intent keeps the metric set focused on the four headline primitives + ECE family + threshold selection. Consumers add what they need. - **CLI.** The toolkit is a library; consumer projects build their own CLI (e.g., the `prompt_injection_*` repos' `evaluate.py` scripts). - **A formal plugin registry / setuptools entry-points system.** The Protocol-based seam is sufficient. - **Optional `fit_platt_calibrator(canonical: bool = True)` flag.** v0.3.0 already canonicalized the impl per Platt 1999 §2.2; the flag would re-introduce the non-canonical variant for backward compatibility but no consumer demand has surfaced. WONTFIX unless asked. ## How to file an upstream wish 1. Add a section in your project's `docs/eval_toolkit_gaps.md` describing the gap, severity, and a sketch of the patch (if known). 2. Open an issue or PR against `eval-toolkit`'s GitHub linking that gaps doc. 3. If you've done the work locally, the PR can be a draft with the suggested patch + tests; we'll reconcile against this roadmap. ## See also - [`CHANGELOG.md`](https://github.com/brandon-behring/eval-toolkit/blob/main/CHANGELOG.md) — release history. - [`docs/MIGRATION.md`](MIGRATION.md) — version-to-version migration guides. - [`docs/methodology/reading_list.md`](methodology/reading_list.md) — citation-level "future work" pointers (statistical methods that could land if there's appetite).