Repo Strategy#

How eval-toolkit is organized today, the target shape, and the rules for when to extract a sub-package into its own repo.

If you’re an AI session entering this repo to propose reorganization, extraction, or a new module: read this entire doc before proposing changes.

The current target shape is locked-in for v0.10.x: in-place mono-repo with the 6 sub-package layout in §4. Don’t propose alternative groupings. Don’t propose multi-repo splits unless an extraction passes the §5 checklist.

Before any extraction proposal, run the audit in §6. If the audit fails the 4-question checklist (≥ 2 yes), don’t propose extraction; instead propose closing the gap (e.g., add a deprecation, ask for a downstream signal).

Next scheduled audit: v0.13.0 (every 3 minor releases; current is v0.10.0).

What’s in this doc:

§1 Context — why this exists.
§2 Audit findings — the dependency-graph state as of v0.10.0.
§3 LOC + concern map — vocabulary the rest of the doc uses.
§4 Recommended target shape — the 6-bucket layout.
§5 Extraction checklist — the 4 questions that decide whether to spin off a sub-package.
§6 Audit cadence — how to re-run this audit.
§7 Anti-goals — paths explicitly ruled out.
§8 Open questions — deferred to a future round.
§9 Cross-links — related docs.

If you only read one section, read §5.

1. Context#

Two motivations drove this strategy doc:

Architectural debt. The README’s three-tier architecture (functional core / Protocol-based orchestration / reproducibility scaffolding) is partly aspirational. Three Tier-2 helpers (leakage, splits, loaders) import upward into harness, which means you can’t use a Splitter without taking the orchestrator along. Two modules (thresholds, operating_points) straddle the Tier 1 / Tier 2 boundary — they’re labeled as Protocols but their impls reach into the math kernels, so they behave like Tier 1 utilities. metrics has in-degree 4 (analysis, operating_points, thresholds, harness all import it), which makes “extract metrics” a refactor that ripples into half the package.

Conceptual sprawl. 24 modules, 12.3K LOC, six distinct concerns under one name: math kernels, eval orchestration, evidence/claims, reproducibility scaffolding, plotting, standalone utilities (text dedup, markdown rendering, paths, config). The “eval-toolkit” name fits the math + orchestration; it strains when applied to a 1300-LOC text-deduplication library or a markdown renderer.

The constraint: stay mono-repo by default. Multi-repo splits incur real coordination overhead (synchronized releases, version pin management, cross-repo PRs). Split only when a candidate proves independent users / release cadence / install-slimness payoff (the 4-question checklist in §5).

This is a roadmap document, not an implementation plan. No code moves in v0.10.x.

2. Audit findings#

Snapshot from a dependency-graph audit run at v0.10.0 (f6ef004).

Coupling map#

True leaves (zero in-package imports — trivially extractable if their consumers go too): _version, protocols, seeds, config, paths, evidence, artifacts, claims, docs, plotting, provenance, text_dedup, calibration, bootstrap, metrics. That’s 15 of 24 modules.

Hub. src/eval_toolkit/harness.py is the central orchestrator — in-degree 3 (splits, loaders, leakage import it), out-degree 5 (it imports from artifacts, bootstrap, metrics, operating_points, protocols, plus TYPE_CHECKING imports of leakage and splits). This is the package’s identity; extracting it would leave a confusingly-named shell.

High-leverage kernel. src/eval_toolkit/metrics.py has in-degree 4 (analysis, operating_points, thresholds, harness). Anything that touches metrics ripples broadly. This makes metrics foundational — and unextractable without taking its consumers along.

Tier 1/2 blurring#

The README claims thresholds and operating_points are Tier 2 (Protocol-based orchestration). In practice:

src/eval_toolkit/thresholds.py defines the ThresholdSelector Protocol and 6 reference impls. The impls import metrics and calibration (both Tier 1). A selector uses a metric — that’s fine — but it means thresholds is half-Tier-1 in coupling terms.
src/eval_toolkit/operating_points.py (178 LOC) imports metrics
- thresholds. It’s labeled Tier 2 but acts like a Tier 1 utility for fit-then-apply threshold workflows.

This blurring is annotated in the doc, but the fix (relabeling + moving operating_points under the math kernels) is deferred to the v0.11.0 reorg in §4.

Upward-import anti-pattern#

Three Tier-2 helpers reach into the orchestrator for type shapes:

src/eval_toolkit/leakage.py imports harness (for EvalSlice).
src/eval_toolkit/splits.py imports harness (same).
src/eval_toolkit/loaders.py imports harness + provenance.

These helpers should depend on a type contract, not the orchestrator. The fix in v0.11.0 is to extract a small eval_toolkit/_types.py module with the EvalSlice and RunResult dataclasses, and have helpers + harness both depend on _types. Pure internal refactor; no public API change. Eliminates the cycle that blocks “I want a Splitter without the harness” use cases.

3. LOC + concern map#

Concern bucket	Modules	LOC
Math kernels (Tier 1)	metrics, bootstrap, calibration, thresholds, operating_points	5389
Eval orchestration (Tier 2)	harness, splits, loaders, leakage, analysis	2698
Evidence layer (v0.9)	claims, artifacts, evidence	948
Reproducibility (Tier 3)	manifest, provenance, seeds	797
Visualization	plotting	956
Standalone utilities	text_dedup, docs, paths, config, protocols, _version	1973

Three modules — metrics (1550), bootstrap (1149), text_dedup (1327) — are 32% of the LOC.

4. Recommended target shape#

Phase 1 — in-place mono-repo reorganization (target: v0.11.0)#

Restructure src/eval_toolkit/ into 6 sub-packages, one per bucket:

src/eval_toolkit/
├── __init__.py            # lazy re-exports preserved (back-compat)
├── _version.py
├── _types.py              # NEW: EvalSlice, RunResult (extracted from harness)
├── core/                  # math kernels — Tier 1
│   ├── metrics.py
│   ├── bootstrap.py
│   ├── calibration.py
│   ├── thresholds.py
│   └── operating_points.py
├── harness/               # eval orchestration — Tier 2
│   ├── __init__.py        # re-exports evaluate, evaluate_folded, EvalSlice
│   ├── orchestrator.py    # was harness.py
│   ├── splits.py
│   ├── loaders.py
│   ├── leakage.py
│   └── analysis.py
├── evidence/              # claims / artifacts (v0.9)
│   ├── claims.py
│   ├── artifacts.py
│   └── evidence.py
├── scaffolding/           # reproducibility — Tier 3
│   ├── manifest.py
│   ├── provenance.py
│   └── seeds.py
├── viz/                   # plotting (matplotlib-only)
│   └── plotting.py
└── utils/                 # standalone tools
    ├── text_dedup.py
    ├── docs.py            # markdown rendering
    ├── paths.py
    ├── config.py
    └── protocols.py       # the trait crate

Top-level eval_toolkit/__init__.py keeps every existing lazy export working — consumers see no breaking change. Internally:

Helpers in harness/ depend on _types instead of importing harness/orchestrator. Cycle eliminated.
A CI lint test (tests/test_import_directions.py) parses src/eval_toolkit/**/*.py ASTs and fails the build if a sub-package imports from a “higher” sub-package per a declared import order: _types < core < (harness, evidence, scaffolding, viz, utils).
The protocols module moves to utils/ — it’s a trait crate with zero runtime cost; placing it under utils makes the conceptual map cleaner (it’s a contract used by orchestration but not part of it).

Phase 2 — earn-then-extract (long-term, no committed schedule)#

Three candidates, ranked by independence + value:

eval-toolkit-text-dedup — separate PyPI package + repo. 1327 LOC, sklearn-heavy, zero in-package imports, conceptually orthogonal (it’s a similarity/dedup library, not an eval kernel). The current eval_toolkit.utils.text_dedup becomes a thin re-export from the new package + a DeprecationWarning. Trigger: a downstream consumer asks for it standalone, or the sklearn-version pin starts costing the main package.
eval-toolkit-plotting — separate PyPI package + repo. 956 LOC, matplotlib-only (already gated behind [plotting] extra), zero in-package imports, conceptually orthogonal. The split lets matplotlib churn (3.x bumps, deprecations) without touching the math core. Trigger: a matplotlib release breaks something and we want to ship a viz-only fix without bumping the main package.
eval-toolkit-evidence — separate package, only if v0.10+ evidence-layer adoption proves it has independent consumers. 948 LOC, recently added in v0.9, churning. The gate framework (claims.py) is generic enough to be useful outside eval-toolkit. But: deeply integrated with the RunResult shape, so extraction means freezing that contract first. Trigger: someone wants to use the gate framework against a non-eval-toolkit result payload.

What we’re NOT extracting#

metrics, bootstrap, calibration — too foundational; in-degree 4 on metrics means everything else falls in line behind it. These stay the math core of eval-toolkit.
harness + the orchestration ring — this is the package’s identity. Extracting it would leave a confusingly-named shell.
manifest + provenance + seeds — small, tight Tier-3 pair, no consumer demand to use them outside an eval context.

5. Extraction checklist#

For any module X proposed for extraction into a separate PyPI package, run all four questions. Extract only if ≥ 2 of 4 are yes. Default = stay together.

Each question has a concrete command so the audit is reproducible by a human or AI session.

Q1 — Has X been touched in the last 2 minor releases for reasons unrelated to the rest of the package?#

# Replace v0.<n-2> with the version 2 minors back from current
git log --oneline v0.<n-2>..HEAD -- src/eval_toolkit/<X>.py

If all listed commits’ subject lines are X-only (no co-modifications of harness, metrics, etc.), mark yes.

Q2 — Has at least one downstream consumer asked to use X without pulling the rest of eval-toolkit?#

# Search known consumer repos
gh search code "eval_toolkit.<X>" --owner <user>
# Or grep manually if consumer repos are private

Known consumer repos: prompt-injection-detector, prompt-injection-showcase, prompt-injection-sdd. If any has filed an issue, asked in chat, or imported just <X> without the harness, mark yes.

Q3 — Does X have heavy / churning external deps whose churn would otherwise force main-package bumps?#

# Check release cadence of X's primary external dep
pip index versions <X-key-dep>

If the dep released ≥ 2 minor versions in the last 6 months and we’ve seen breakage in the main package because of it, mark yes.

Q4 — Is X’s public API stable enough to commit to a separate release cadence?#

git log --oneline -- src/eval_toolkit/<X>.py | head -20
grep -A 50 "^__all__" src/eval_toolkit/<X>.py

If __all__ hasn’t changed in the last 2 minor releases, mark yes. (A churning API can’t be split off because every change forces a new release of the spun-off package.)

Verdict#

0–1 yes — do not extract. Keep X in the main package.
2 yes — extraction is justified. Open a planning round with this doc as the entry point.
3–4 yes — extraction is overdue. Prioritize.

Document the verdict in this doc as an appendix, even if the verdict is “do not extract.” Silence is a valid audit result; an explicit “ran the audit, nothing changed” entry prevents future sessions from re-deriving the same conclusion.

6. Audit cadence#

Cadence: every 3 minor releases. Next audit due at v0.13.0 (current is v0.10.0).

How to run the full audit#

These commands a human or AI session can copy-paste:

# 1. Refresh the dependency graph
grep -rnE "^from eval_toolkit\.|^import eval_toolkit\." src/eval_toolkit/ \
  | grep -v "TYPE_CHECKING" \
  | sort | uniq

# 2. Per-module LOC (re-confirms the LOC + concern map in §3)
wc -l src/eval_toolkit/**/*.py

# 3. Per-module __all__ size
for f in src/eval_toolkit/**/*.py; do
  printf "%-40s " "$f"
  grep -A 50 "^__all__" "$f" | grep -c '"[A-Za-z_]'
done

# 4. Coverage per module (re-confirms the v0.10.0 ≥90% per-module floor)
uv run pytest -q --cov=eval_toolkit --cov-report=term --no-header

Then for each candidate module, run the §5 checklist commands.

Audit owner#

Whoever opens a PR on or after the v0.13.0 milestone. The output goes into:

A new appendix to this doc (§Audit log — added in the same PR that triggers the audit), OR
A comment on a tracking issue, with a link from this doc.

Either way, the audit produces a written artifact. If no extraction passes the §5 checklist, write “audit ran at v0.13.0, nothing changed” and move on. Silence is a valid result.

7. Anti-goals#

Things this strategy explicitly rules out:

Don’t split into 6 separate repos at once. Multi-repo coordination overhead (synchronized releases, version pin management, cross-repo PRs) is real; the value has to be earned per-extraction via the §5 checklist.
Don’t propose alternative bucket groupings. The 6-bucket cut in §4 is the agreed shape. Re-litigating it costs more than it buys. If a new module genuinely doesn’t fit any bucket, that’s a signal to write up a §8 open question, not to redraw the map.
Don’t extract metrics, bootstrap, or calibration as separate packages. Foundational math kernels with multiple in-package consumers; circular installs (eval-toolkit-core depending on eval-toolkit-metrics etc.) are worse than co-location.
Don’t add a plugin / entry-point system. The Protocol layer in eval_toolkit/utils/protocols.py already gives consumers Tier 2 extension; entry-points are a lot of machinery for the same outcome.
Don’t rename the package. eval-toolkit stays as the PyPI name. Sub-packages get sensible names within (eval_toolkit.viz, eval_toolkit.core, etc.).
Don’t pre-emptively add deprecation warnings to today’s API. No DeprecationWarning lands until a Phase 2 extraction actually triggers and the back-compat shim is the only thing left in the main package.

Parallelism (v0.34.0+)#

The toolkit was historically single-threaded — not by policy, just because no primitive needed parallelism. v0.34.0 codifies the explicit pattern: opt-in per-function n_jobs parameter backed by the internal _parallel.parallel_map helper (joblib loky backend; reproducibility-by-default via np.random.SeedSequence).

What this means for contributors:

Adding n_jobs to a new function is allowed and encouraged when the function has a Python-level loop over independent work units with medium+ per-item cost. Follow the checklist in methodology/parallelism.md §”When to add n_jobs”.
Use the helper, don’t roll your own. All parallelism flows through _parallel.parallel_map; no inline joblib.Parallel, no concurrent.futures, no raw multiprocessing, no asyncio.
Default sequential. n_jobs: int = 1 keeps existing call sites unchanged and preserves reproducibility/traceback fidelity by default.

8. Open questions#

Deferred to the v0.13.0 audit (or earlier if a trigger fires):

Should analysis.py (CSV/JSONL prediction readers) move to evidence/? It’s used by the evidence-core paired-diff workflow more than by the eval orchestrator. Currently lives in harness/ per §4 because it imports bootstrap + metrics, but the consumer shape suggests it belongs with claims + artifacts.
Should protocols.py become its own published package (eval-toolkit-protocols) so consumer libraries can declare a dependency on the contract without depending on any implementation? Useful only if there’s ever a second harness implementation. Today there isn’t.
v1.0 timing — when does the API freeze trigger? The §5 checklist’s Q4 (__all__ stability) is the relevant signal.

9. Cross-links#

roadmap.md — broader feature roadmap. This doc is the organizational roadmap; that’s the feature roadmap.
methodology/versioning.md — the schema-evolution policy. Any extracted package’s API contract should follow the same additive-fields, additionalProperties: true discipline documented there.
MIGRATION.md — index of per-version migration guides. The v0.11.0 reorg gets its own entry there; future Phase-2 extractions get their own entries too.
extending.md — the Protocol-by-Protocol guide for custom Scorers, Splitters, gates. Stays the recommended extension point regardless of how the package is internally organized.