Repo Strategy#

How eval-toolkit is organized today, the target shape, and the rules for when to extract a sub-package into its own repo.

If you’re an AI session entering this repo to propose reorganization, extraction, or a new module: read this entire doc before proposing changes.

The current target shape is locked-in for v0.10.x: in-place mono-repo with the 6 sub-package layout in §4. Don’t propose alternative groupings. Don’t propose multi-repo splits unless an extraction passes the §5 checklist.

Before any extraction proposal, run the audit in §6. If the audit fails the 4-question checklist (≥ 2 yes), don’t propose extraction; instead propose closing the gap (e.g., add a deprecation, ask for a downstream signal).

Next scheduled audit: v0.13.0 (every 3 minor releases; current is v0.10.0).

What’s in this doc:

If you only read one section, read §5.


1. Context#

Two motivations drove this strategy doc:

Architectural debt. The README’s three-tier architecture (functional core / Protocol-based orchestration / reproducibility scaffolding) is partly aspirational. Three Tier-2 helpers (leakage, splits, loaders) import upward into harness, which means you can’t use a Splitter without taking the orchestrator along. Two modules (thresholds, operating_points) straddle the Tier 1 / Tier 2 boundary — they’re labeled as Protocols but their impls reach into the math kernels, so they behave like Tier 1 utilities. metrics has in-degree 4 (analysis, operating_points, thresholds, harness all import it), which makes “extract metrics” a refactor that ripples into half the package.

Conceptual sprawl. 24 modules, 12.3K LOC, six distinct concerns under one name: math kernels, eval orchestration, evidence/claims, reproducibility scaffolding, plotting, standalone utilities (text dedup, markdown rendering, paths, config). The “eval-toolkit” name fits the math + orchestration; it strains when applied to a 1300-LOC text-deduplication library or a markdown renderer.

The constraint: stay mono-repo by default. Multi-repo splits incur real coordination overhead (synchronized releases, version pin management, cross-repo PRs). Split only when a candidate proves independent users / release cadence / install-slimness payoff (the 4-question checklist in §5).

This is a roadmap document, not an implementation plan. No code moves in v0.10.x.

2. Audit findings#

Snapshot from a dependency-graph audit run at v0.10.0 (f6ef004).

Coupling map#

True leaves (zero in-package imports — trivially extractable if their consumers go too): _version, protocols, seeds, config, paths, evidence, artifacts, claims, docs, plotting, provenance, text_dedup, calibration, bootstrap, metrics. That’s 15 of 24 modules.

Hub. src/eval_toolkit/harness.py is the central orchestrator — in-degree 3 (splits, loaders, leakage import it), out-degree 5 (it imports from artifacts, bootstrap, metrics, operating_points, protocols, plus TYPE_CHECKING imports of leakage and splits). This is the package’s identity; extracting it would leave a confusingly-named shell.

High-leverage kernel. src/eval_toolkit/metrics.py has in-degree 4 (analysis, operating_points, thresholds, harness). Anything that touches metrics ripples broadly. This makes metrics foundational — and unextractable without taking its consumers along.

Tier 1/2 blurring#

The README claims thresholds and operating_points are Tier 2 (Protocol-based orchestration). In practice:

  • src/eval_toolkit/thresholds.py defines the ThresholdSelector Protocol and 6 reference impls. The impls import metrics and calibration (both Tier 1). A selector uses a metric — that’s fine — but it means thresholds is half-Tier-1 in coupling terms.

  • src/eval_toolkit/operating_points.py (178 LOC) imports metrics

    • thresholds. It’s labeled Tier 2 but acts like a Tier 1 utility for fit-then-apply threshold workflows.

This blurring is annotated in the doc, but the fix (relabeling + moving operating_points under the math kernels) is deferred to the v0.11.0 reorg in §4.

Upward-import anti-pattern#

Three Tier-2 helpers reach into the orchestrator for type shapes:

  • src/eval_toolkit/leakage.py imports harness (for EvalSlice).

  • src/eval_toolkit/splits.py imports harness (same).

  • src/eval_toolkit/loaders.py imports harness + provenance.

These helpers should depend on a type contract, not the orchestrator. The fix in v0.11.0 is to extract a small eval_toolkit/_types.py module with the EvalSlice and RunResult dataclasses, and have helpers + harness both depend on _types. Pure internal refactor; no public API change. Eliminates the cycle that blocks “I want a Splitter without the harness” use cases.

3. LOC + concern map#

Concern bucket

Modules

LOC

Math kernels (Tier 1)

metrics, bootstrap, calibration, thresholds, operating_points

5389

Eval orchestration (Tier 2)

harness, splits, loaders, leakage, analysis

2698

Evidence layer (v0.9)

claims, artifacts, evidence

948

Reproducibility (Tier 3)

manifest, provenance, seeds

797

Visualization

plotting

956

Standalone utilities

text_dedup, docs, paths, config, protocols, _version

1973

Three modules — metrics (1550), bootstrap (1149), text_dedup (1327) — are 32% of the LOC.

5. Extraction checklist#

For any module X proposed for extraction into a separate PyPI package, run all four questions. Extract only if ≥ 2 of 4 are yes. Default = stay together.

Each question has a concrete command so the audit is reproducible by a human or AI session.

Q1 — Has X been touched in the last 2 minor releases for reasons unrelated to the rest of the package?#

# Replace v0.<n-2> with the version 2 minors back from current
git log --oneline v0.<n-2>..HEAD -- src/eval_toolkit/<X>.py

If all listed commits’ subject lines are X-only (no co-modifications of harness, metrics, etc.), mark yes.

Q2 — Has at least one downstream consumer asked to use X without pulling the rest of eval-toolkit?#

# Search known consumer repos
gh search code "eval_toolkit.<X>" --owner <user>
# Or grep manually if consumer repos are private

Known consumer repos: prompt-injection-detector, prompt-injection-showcase, prompt-injection-sdd. If any has filed an issue, asked in chat, or imported just <X> without the harness, mark yes.

Q3 — Does X have heavy / churning external deps whose churn would otherwise force main-package bumps?#

# Check release cadence of X's primary external dep
pip index versions <X-key-dep>

If the dep released ≥ 2 minor versions in the last 6 months and we’ve seen breakage in the main package because of it, mark yes.

Q4 — Is X’s public API stable enough to commit to a separate release cadence?#

git log --oneline -- src/eval_toolkit/<X>.py | head -20
grep -A 50 "^__all__" src/eval_toolkit/<X>.py

If __all__ hasn’t changed in the last 2 minor releases, mark yes. (A churning API can’t be split off because every change forces a new release of the spun-off package.)

Verdict#

  • 0–1 yes — do not extract. Keep X in the main package.

  • 2 yes — extraction is justified. Open a planning round with this doc as the entry point.

  • 3–4 yes — extraction is overdue. Prioritize.

Document the verdict in this doc as an appendix, even if the verdict is “do not extract.” Silence is a valid audit result; an explicit “ran the audit, nothing changed” entry prevents future sessions from re-deriving the same conclusion.

6. Audit cadence#

Cadence: every 3 minor releases. Next audit due at v0.13.0 (current is v0.10.0).

How to run the full audit#

These commands a human or AI session can copy-paste:

# 1. Refresh the dependency graph
grep -rnE "^from eval_toolkit\.|^import eval_toolkit\." src/eval_toolkit/ \
  | grep -v "TYPE_CHECKING" \
  | sort | uniq

# 2. Per-module LOC (re-confirms the LOC + concern map in §3)
wc -l src/eval_toolkit/**/*.py

# 3. Per-module __all__ size
for f in src/eval_toolkit/**/*.py; do
  printf "%-40s " "$f"
  grep -A 50 "^__all__" "$f" | grep -c '"[A-Za-z_]'
done

# 4. Coverage per module (re-confirms the v0.10.0 ≥90% per-module floor)
uv run pytest -q --cov=eval_toolkit --cov-report=term --no-header

Then for each candidate module, run the §5 checklist commands.

Audit owner#

Whoever opens a PR on or after the v0.13.0 milestone. The output goes into:

  • A new appendix to this doc (§Audit log — added in the same PR that triggers the audit), OR

  • A comment on a tracking issue, with a link from this doc.

Either way, the audit produces a written artifact. If no extraction passes the §5 checklist, write “audit ran at v0.13.0, nothing changed” and move on. Silence is a valid result.

7. Anti-goals#

Things this strategy explicitly rules out:

  • Don’t split into 6 separate repos at once. Multi-repo coordination overhead (synchronized releases, version pin management, cross-repo PRs) is real; the value has to be earned per-extraction via the §5 checklist.

  • Don’t propose alternative bucket groupings. The 6-bucket cut in §4 is the agreed shape. Re-litigating it costs more than it buys. If a new module genuinely doesn’t fit any bucket, that’s a signal to write up a §8 open question, not to redraw the map.

  • Don’t extract metrics, bootstrap, or calibration as separate packages. Foundational math kernels with multiple in-package consumers; circular installs (eval-toolkit-core depending on eval-toolkit-metrics etc.) are worse than co-location.

  • Don’t add a plugin / entry-point system. The Protocol layer in eval_toolkit/utils/protocols.py already gives consumers Tier 2 extension; entry-points are a lot of machinery for the same outcome.

  • Don’t rename the package. eval-toolkit stays as the PyPI name. Sub-packages get sensible names within (eval_toolkit.viz, eval_toolkit.core, etc.).

  • Don’t pre-emptively add deprecation warnings to today’s API. No DeprecationWarning lands until a Phase 2 extraction actually triggers and the back-compat shim is the only thing left in the main package.

Parallelism (v0.34.0+)#

The toolkit was historically single-threaded — not by policy, just because no primitive needed parallelism. v0.34.0 codifies the explicit pattern: opt-in per-function n_jobs parameter backed by the internal _parallel.parallel_map helper (joblib loky backend; reproducibility-by-default via np.random.SeedSequence).

What this means for contributors:

  • Adding n_jobs to a new function is allowed and encouraged when the function has a Python-level loop over independent work units with medium+ per-item cost. Follow the checklist in methodology/parallelism.md §”When to add n_jobs.

  • Use the helper, don’t roll your own. All parallelism flows through _parallel.parallel_map; no inline joblib.Parallel, no concurrent.futures, no raw multiprocessing, no asyncio.

  • Default sequential. n_jobs: int = 1 keeps existing call sites unchanged and preserves reproducibility/traceback fidelity by default.

8. Open questions#

Deferred to the v0.13.0 audit (or earlier if a trigger fires):

  • Should analysis.py (CSV/JSONL prediction readers) move to evidence/? It’s used by the evidence-core paired-diff workflow more than by the eval orchestrator. Currently lives in harness/ per §4 because it imports bootstrap + metrics, but the consumer shape suggests it belongs with claims + artifacts.

  • Should protocols.py become its own published package (eval-toolkit-protocols) so consumer libraries can declare a dependency on the contract without depending on any implementation? Useful only if there’s ever a second harness implementation. Today there isn’t.

  • v1.0 timing — when does the API freeze trigger? The §5 checklist’s Q4 (__all__ stability) is the relevant signal.