# Repo Strategy How `eval-toolkit` is organized today, the target shape, and the rules for when to extract a sub-package into its own repo. > **If you're an AI session entering this repo to propose > reorganization, extraction, or a new module: read this entire doc > before proposing changes.** > > **The current target shape is locked-in for v0.10.x:** in-place > mono-repo with the 6 sub-package layout in [§4](#target-shape). > Don't propose alternative groupings. Don't propose multi-repo > splits unless an extraction passes the [§5 checklist](#extraction-checklist). > > **Before any extraction proposal, run the audit in [§6](#audit-cadence).** > If the audit fails the 4-question checklist (≥ 2 yes), don't > propose extraction; instead propose closing the gap (e.g., add a > deprecation, ask for a downstream signal). > > **Next scheduled audit: v0.13.0** (every 3 minor releases; current > is v0.10.0). What's in this doc: - §1 [Context](#context) — why this exists. - §2 [Audit findings](#audit-findings) — the dependency-graph state as of v0.10.0. - §3 [LOC + concern map](#loc-concern-map) — vocabulary the rest of the doc uses. - §4 [Recommended target shape](#target-shape) — the 6-bucket layout. - §5 [Extraction checklist](#extraction-checklist) — the 4 questions that decide whether to spin off a sub-package. - §6 [Audit cadence](#audit-cadence) — how to re-run this audit. - §7 [Anti-goals](#anti-goals) — paths explicitly ruled out. - §8 [Open questions](#open-questions) — deferred to a future round. - §9 [Cross-links](#cross-links) — related docs. If you only read one section, read [§5](#extraction-checklist). --- (context)= ## 1. Context Two motivations drove this strategy doc: **Architectural debt.** The README's three-tier architecture (functional core / Protocol-based orchestration / reproducibility scaffolding) is partly aspirational. Three Tier-2 helpers (`leakage`, `splits`, `loaders`) import *upward* into `harness`, which means you can't use a Splitter without taking the orchestrator along. Two modules (`thresholds`, `operating_points`) straddle the Tier 1 / Tier 2 boundary — they're labeled as Protocols but their impls reach into the math kernels, so they behave like Tier 1 utilities. `metrics` has in-degree 4 (analysis, operating_points, thresholds, harness all import it), which makes "extract metrics" a refactor that ripples into half the package. **Conceptual sprawl.** 24 modules, 12.3K LOC, six distinct concerns under one name: math kernels, eval orchestration, evidence/claims, reproducibility scaffolding, plotting, standalone utilities (text dedup, markdown rendering, paths, config). The "eval-toolkit" name fits the math + orchestration; it strains when applied to a 1300-LOC text-deduplication library or a markdown renderer. The constraint: stay mono-repo by default. Multi-repo splits incur real coordination overhead (synchronized releases, version pin management, cross-repo PRs). Split only when a candidate proves independent users / release cadence / install-slimness payoff (the 4-question checklist in §5). This is a **roadmap document**, not an implementation plan. No code moves in v0.10.x. (audit-findings)= ## 2. Audit findings Snapshot from a dependency-graph audit run at v0.10.0 (`f6ef004`). ### Coupling map **True leaves** (zero in-package imports — trivially extractable *if* their consumers go too): `_version`, `protocols`, `seeds`, `config`, `paths`, `evidence`, `artifacts`, `claims`, `docs`, `plotting`, `provenance`, `text_dedup`, `calibration`, `bootstrap`, `metrics`. That's 15 of 24 modules. **Hub.** `src/eval_toolkit/harness.py` is the central orchestrator — in-degree 3 (`splits`, `loaders`, `leakage` import it), out-degree 5 (it imports from `artifacts`, `bootstrap`, `metrics`, `operating_points`, `protocols`, plus TYPE_CHECKING imports of `leakage` and `splits`). This is the package's identity; extracting it would leave a confusingly-named shell. **High-leverage kernel.** `src/eval_toolkit/metrics.py` has in-degree 4 (`analysis`, `operating_points`, `thresholds`, `harness`). Anything that touches metrics ripples broadly. This makes metrics foundational — and unextractable without taking its consumers along. ### Tier 1/2 blurring The README claims `thresholds` and `operating_points` are Tier 2 (Protocol-based orchestration). In practice: - `src/eval_toolkit/thresholds.py` defines the `ThresholdSelector` Protocol and 6 reference impls. The impls import `metrics` and `calibration` (both Tier 1). A selector *uses* a metric — that's fine — but it means `thresholds` is half-Tier-1 in coupling terms. - `src/eval_toolkit/operating_points.py` (178 LOC) imports `metrics` + `thresholds`. It's labeled Tier 2 but acts like a Tier 1 utility for fit-then-apply threshold workflows. This blurring is annotated in the doc, but the *fix* (relabeling + moving `operating_points` under the math kernels) is deferred to the v0.11.0 reorg in §4. ### Upward-import anti-pattern Three Tier-2 helpers reach into the orchestrator for type shapes: - `src/eval_toolkit/leakage.py` imports `harness` (for `EvalSlice`). - `src/eval_toolkit/splits.py` imports `harness` (same). - `src/eval_toolkit/loaders.py` imports `harness` + `provenance`. These helpers should depend on a *type contract*, not the orchestrator. The fix in v0.11.0 is to extract a small `eval_toolkit/_types.py` module with the `EvalSlice` and `RunResult` dataclasses, and have helpers + harness both depend on `_types`. Pure internal refactor; no public API change. Eliminates the cycle that blocks "I want a Splitter without the harness" use cases. (loc-concern-map)= ## 3. LOC + concern map | Concern bucket | Modules | LOC | |---|---|---:| | **Math kernels** (Tier 1) | metrics, bootstrap, calibration, thresholds, operating_points | 5389 | | **Eval orchestration** (Tier 2) | harness, splits, loaders, leakage, analysis | 2698 | | **Evidence layer** (v0.9) | claims, artifacts, evidence | 948 | | **Reproducibility** (Tier 3) | manifest, provenance, seeds | 797 | | **Visualization** | plotting | 956 | | **Standalone utilities** | text_dedup, docs, paths, config, protocols, _version | 1973 | Three modules — `metrics` (1550), `bootstrap` (1149), `text_dedup` (1327) — are 32% of the LOC. (target-shape)= ## 4. Recommended target shape ### Phase 1 — in-place mono-repo reorganization (target: v0.11.0) Restructure `src/eval_toolkit/` into 6 sub-packages, one per bucket: ``` src/eval_toolkit/ ├── __init__.py # lazy re-exports preserved (back-compat) ├── _version.py ├── _types.py # NEW: EvalSlice, RunResult (extracted from harness) ├── core/ # math kernels — Tier 1 │ ├── metrics.py │ ├── bootstrap.py │ ├── calibration.py │ ├── thresholds.py │ └── operating_points.py ├── harness/ # eval orchestration — Tier 2 │ ├── __init__.py # re-exports evaluate, evaluate_folded, EvalSlice │ ├── orchestrator.py # was harness.py │ ├── splits.py │ ├── loaders.py │ ├── leakage.py │ └── analysis.py ├── evidence/ # claims / artifacts (v0.9) │ ├── claims.py │ ├── artifacts.py │ └── evidence.py ├── scaffolding/ # reproducibility — Tier 3 │ ├── manifest.py │ ├── provenance.py │ └── seeds.py ├── viz/ # plotting (matplotlib-only) │ └── plotting.py └── utils/ # standalone tools ├── text_dedup.py ├── docs.py # markdown rendering ├── paths.py ├── config.py └── protocols.py # the trait crate ``` Top-level `eval_toolkit/__init__.py` keeps every existing lazy export working — consumers see no breaking change. Internally: - Helpers in `harness/` depend on `_types` instead of importing `harness/orchestrator`. Cycle eliminated. - A CI lint test (`tests/test_import_directions.py`) parses `src/eval_toolkit/**/*.py` ASTs and fails the build if a sub-package imports from a "higher" sub-package per a declared import order: `_types < core < (harness, evidence, scaffolding, viz, utils)`. - The `protocols` module moves to `utils/` — it's a trait crate with zero runtime cost; placing it under utils makes the conceptual map cleaner (it's a *contract* used by orchestration but not part of it). ### Phase 2 — earn-then-extract (long-term, no committed schedule) Three candidates, ranked by independence + value: 1. **`eval-toolkit-text-dedup`** — separate PyPI package + repo. 1327 LOC, sklearn-heavy, zero in-package imports, conceptually orthogonal (it's a similarity/dedup library, not an eval kernel). The current `eval_toolkit.utils.text_dedup` becomes a thin re-export from the new package + a `DeprecationWarning`. **Trigger:** a downstream consumer asks for it standalone, or the sklearn-version pin starts costing the main package. 2. **`eval-toolkit-plotting`** — separate PyPI package + repo. 956 LOC, matplotlib-only (already gated behind `[plotting]` extra), zero in-package imports, conceptually orthogonal. The split lets matplotlib churn (3.x bumps, deprecations) without touching the math core. **Trigger:** a matplotlib release breaks something and we want to ship a viz-only fix without bumping the main package. 3. **`eval-toolkit-evidence`** — separate package, only if v0.10+ evidence-layer adoption proves it has independent consumers. 948 LOC, recently added in v0.9, churning. The gate framework (`claims.py`) is generic enough to be useful outside eval-toolkit. But: deeply integrated with the `RunResult` shape, so extraction means freezing that contract first. **Trigger:** someone wants to use the gate framework against a non-eval-toolkit result payload. ### What we're NOT extracting - `metrics`, `bootstrap`, `calibration` — too foundational; in-degree 4 on `metrics` means everything else falls in line behind it. These stay the math core of `eval-toolkit`. - `harness` + the orchestration ring — this *is* the package's identity. Extracting it would leave a confusingly-named shell. - `manifest` + `provenance` + `seeds` — small, tight Tier-3 pair, no consumer demand to use them outside an eval context. (extraction-checklist)= ## 5. Extraction checklist For any module `X` proposed for extraction into a separate PyPI package, run all four questions. **Extract only if ≥ 2 of 4 are yes.** Default = stay together. Each question has a concrete command so the audit is reproducible by a human or AI session. ### Q1 — Has X been touched in the last 2 minor releases for reasons unrelated to the rest of the package? ```bash # Replace v0. with the version 2 minors back from current git log --oneline v0...HEAD -- src/eval_toolkit/.py ``` If all listed commits' subject lines are X-only (no co-modifications of `harness`, `metrics`, etc.), mark **yes**. ### Q2 — Has at least one downstream consumer asked to use X without pulling the rest of eval-toolkit? ```bash # Search known consumer repos gh search code "eval_toolkit." --owner # Or grep manually if consumer repos are private ``` Known consumer repos: `prompt-injection-detector`, `prompt-injection-showcase`, `prompt-injection-sdd`. If any has filed an issue, asked in chat, or imported just `` without the harness, mark **yes**. ### Q3 — Does X have heavy / churning external deps whose churn would otherwise force main-package bumps? ```bash # Check release cadence of X's primary external dep pip index versions ``` If the dep released ≥ 2 minor versions in the last 6 months *and* we've seen breakage in the main package because of it, mark **yes**. ### Q4 — Is X's public API stable enough to commit to a separate release cadence? ```bash git log --oneline -- src/eval_toolkit/.py | head -20 grep -A 50 "^__all__" src/eval_toolkit/.py ``` If `__all__` hasn't changed in the last 2 minor releases, mark **yes**. (A churning API can't be split off because every change forces a new release of the spun-off package.) ### Verdict - **0–1 yes** — do not extract. Keep X in the main package. - **2 yes** — extraction is justified. Open a planning round with this doc as the entry point. - **3–4 yes** — extraction is overdue. Prioritize. Document the verdict in this doc as an appendix, even if the verdict is "do not extract." Silence is a valid audit result; an explicit "ran the audit, nothing changed" entry prevents future sessions from re-deriving the same conclusion. (audit-cadence)= ## 6. Audit cadence **Cadence:** every 3 minor releases. Next audit due at **v0.13.0** (current is v0.10.0). ### How to run the full audit These commands a human or AI session can copy-paste: ```bash # 1. Refresh the dependency graph grep -rnE "^from eval_toolkit\.|^import eval_toolkit\." src/eval_toolkit/ \ | grep -v "TYPE_CHECKING" \ | sort | uniq # 2. Per-module LOC (re-confirms the LOC + concern map in §3) wc -l src/eval_toolkit/**/*.py # 3. Per-module __all__ size for f in src/eval_toolkit/**/*.py; do printf "%-40s " "$f" grep -A 50 "^__all__" "$f" | grep -c '"[A-Za-z_]' done # 4. Coverage per module (re-confirms the v0.10.0 ≥90% per-module floor) uv run pytest -q --cov=eval_toolkit --cov-report=term --no-header ``` Then for each candidate module, run the §5 checklist commands. ### Audit owner Whoever opens a PR on or after the v0.13.0 milestone. The output goes into: - A new appendix to this doc (`§Audit log` — added in the same PR that triggers the audit), OR - A comment on a tracking issue, with a link from this doc. Either way, the audit produces a written artifact. **If no extraction passes the §5 checklist, write "audit ran at v0.13.0, nothing changed" and move on.** Silence is a valid result. (anti-goals)= ## 7. Anti-goals Things this strategy explicitly rules out: - **Don't split into 6 separate repos at once.** Multi-repo coordination overhead (synchronized releases, version pin management, cross-repo PRs) is real; the value has to be earned per-extraction via the §5 checklist. - **Don't propose alternative bucket groupings.** The 6-bucket cut in §4 is the agreed shape. Re-litigating it costs more than it buys. If a new module genuinely doesn't fit any bucket, that's a signal to write up a §8 open question, not to redraw the map. - **Don't extract `metrics`, `bootstrap`, or `calibration` as separate packages.** Foundational math kernels with multiple in-package consumers; circular installs (`eval-toolkit-core` depending on `eval-toolkit-metrics` etc.) are worse than co-location. - **Don't add a plugin / entry-point system.** The Protocol layer in `eval_toolkit/utils/protocols.py` already gives consumers Tier 2 extension; entry-points are a lot of machinery for the same outcome. - **Don't rename the package.** `eval-toolkit` stays as the PyPI name. Sub-packages get sensible names within (`eval_toolkit.viz`, `eval_toolkit.core`, etc.). - **Don't pre-emptively add deprecation warnings to today's API.** No `DeprecationWarning` lands until a Phase 2 extraction actually triggers and the back-compat shim is the only thing left in the main package. ### Parallelism (v0.34.0+) The toolkit was historically single-threaded — not by policy, just because no primitive needed parallelism. v0.34.0 codifies the explicit pattern: **opt-in per-function `n_jobs` parameter** backed by the internal [`_parallel.parallel_map`](methodology/parallelism.md) helper (joblib loky backend; reproducibility-by-default via `np.random.SeedSequence`). What this means for contributors: - **Adding `n_jobs` to a new function is allowed and encouraged** when the function has a Python-level loop over independent work units with medium+ per-item cost. Follow the checklist in [methodology/parallelism.md §"When to add `n_jobs`"](methodology/parallelism.md). - **Use the helper, don't roll your own.** All parallelism flows through `_parallel.parallel_map`; no inline `joblib.Parallel`, no `concurrent.futures`, no raw `multiprocessing`, no `asyncio`. - **Default sequential.** `n_jobs: int = 1` keeps existing call sites unchanged and preserves reproducibility/traceback fidelity by default. (open-questions)= ## 8. Open questions Deferred to the v0.13.0 audit (or earlier if a trigger fires): - Should `analysis.py` (CSV/JSONL prediction readers) move to `evidence/`? It's used by the evidence-core paired-diff workflow more than by the eval orchestrator. Currently lives in `harness/` per §4 because it imports `bootstrap` + `metrics`, but the *consumer* shape suggests it belongs with `claims` + `artifacts`. - Should `protocols.py` become its own published package (`eval-toolkit-protocols`) so consumer libraries can declare a dependency on the contract without depending on any implementation? Useful only if there's ever a second harness implementation. Today there isn't. - v1.0 timing — when does the API freeze trigger? The §5 checklist's Q4 (`__all__` stability) is the relevant signal. (cross-links)= ## 9. Cross-links - [`roadmap.md`](roadmap.md) — broader feature roadmap. This doc is the *organizational* roadmap; that's the *feature* roadmap. - [`methodology/versioning.md`](methodology/versioning.md) — the schema-evolution policy. Any extracted package's API contract should follow the same additive-fields, `additionalProperties: true` discipline documented there. - [`MIGRATION.md`](MIGRATION.md) — index of per-version migration guides. The v0.11.0 reorg gets its own entry there; future Phase-2 extractions get their own entries too. - [`extending.md`](extending.md) — the Protocol-by-Protocol guide for custom Scorers, Splitters, gates. Stays the recommended extension point regardless of how the package is internally organized.