Repo Strategy#
How eval-toolkit is organized today, the target shape, and the
rules for when to extract a sub-package into its own repo.
If you’re an AI session entering this repo to propose reorganization, extraction, or a new module: read this entire doc before proposing changes.
The current target shape is locked-in for v0.10.x: in-place mono-repo with the 6 sub-package layout in §4. Don’t propose alternative groupings. Don’t propose multi-repo splits unless an extraction passes the §5 checklist.
Before any extraction proposal, run the audit in §6. If the audit fails the 4-question checklist (≥ 2 yes), don’t propose extraction; instead propose closing the gap (e.g., add a deprecation, ask for a downstream signal).
Next scheduled audit: v0.13.0 (every 3 minor releases; current is v0.10.0).
What’s in this doc:
§1 Context — why this exists.
§2 Audit findings — the dependency-graph state as of v0.10.0.
§3 LOC + concern map — vocabulary the rest of the doc uses.
§4 Recommended target shape — the 6-bucket layout.
§5 Extraction checklist — the 4 questions that decide whether to spin off a sub-package.
§6 Audit cadence — how to re-run this audit.
§7 Anti-goals — paths explicitly ruled out.
§8 Open questions — deferred to a future round.
§9 Cross-links — related docs.
If you only read one section, read §5.
1. Context#
Two motivations drove this strategy doc:
Architectural debt. The README’s three-tier architecture
(functional core / Protocol-based orchestration / reproducibility
scaffolding) is partly aspirational. Three Tier-2 helpers
(leakage, splits, loaders) import upward into harness,
which means you can’t use a Splitter without taking the orchestrator
along. Two modules (thresholds, operating_points) straddle the
Tier 1 / Tier 2 boundary — they’re labeled as Protocols but their
impls reach into the math kernels, so they behave like Tier 1
utilities. metrics has in-degree 4 (analysis,
operating_points, thresholds, harness all import it), which makes
“extract metrics” a refactor that ripples into half the package.
Conceptual sprawl. 24 modules, 12.3K LOC, six distinct concerns under one name: math kernels, eval orchestration, evidence/claims, reproducibility scaffolding, plotting, standalone utilities (text dedup, markdown rendering, paths, config). The “eval-toolkit” name fits the math + orchestration; it strains when applied to a 1300-LOC text-deduplication library or a markdown renderer.
The constraint: stay mono-repo by default. Multi-repo splits incur real coordination overhead (synchronized releases, version pin management, cross-repo PRs). Split only when a candidate proves independent users / release cadence / install-slimness payoff (the 4-question checklist in §5).
This is a roadmap document, not an implementation plan. No code moves in v0.10.x.
2. Audit findings#
Snapshot from a dependency-graph audit run at v0.10.0
(f6ef004).
Coupling map#
True leaves (zero in-package imports — trivially extractable
if their consumers go too): _version, protocols, seeds,
config, paths, evidence, artifacts, claims, docs,
plotting, provenance, text_dedup, calibration, bootstrap,
metrics. That’s 15 of 24 modules.
Hub. src/eval_toolkit/harness.py is the central orchestrator —
in-degree 3 (splits, loaders, leakage import it), out-degree 5
(it imports from artifacts, bootstrap, metrics,
operating_points, protocols, plus TYPE_CHECKING imports of
leakage and splits). This is the package’s identity; extracting
it would leave a confusingly-named shell.
High-leverage kernel. src/eval_toolkit/metrics.py has
in-degree 4 (analysis, operating_points, thresholds,
harness). Anything that touches metrics ripples broadly. This
makes metrics foundational — and unextractable without taking its
consumers along.
Tier 1/2 blurring#
The README claims thresholds and operating_points are Tier 2
(Protocol-based orchestration). In practice:
src/eval_toolkit/thresholds.pydefines theThresholdSelectorProtocol and 6 reference impls. The impls importmetricsandcalibration(both Tier 1). A selector uses a metric — that’s fine — but it meansthresholdsis half-Tier-1 in coupling terms.src/eval_toolkit/operating_points.py(178 LOC) importsmetricsthresholds. It’s labeled Tier 2 but acts like a Tier 1 utility for fit-then-apply threshold workflows.
This blurring is annotated in the doc, but the fix (relabeling +
moving operating_points under the math kernels) is deferred to the
v0.11.0 reorg in §4.
Upward-import anti-pattern#
Three Tier-2 helpers reach into the orchestrator for type shapes:
src/eval_toolkit/leakage.pyimportsharness(forEvalSlice).src/eval_toolkit/splits.pyimportsharness(same).src/eval_toolkit/loaders.pyimportsharness+provenance.
These helpers should depend on a type contract, not the
orchestrator. The fix in v0.11.0 is to extract a small
eval_toolkit/_types.py module with the EvalSlice and RunResult
dataclasses, and have helpers + harness both depend on _types.
Pure internal refactor; no public API change. Eliminates the cycle
that blocks “I want a Splitter without the harness” use cases.
3. LOC + concern map#
Concern bucket |
Modules |
LOC |
|---|---|---|
Math kernels (Tier 1) |
metrics, bootstrap, calibration, thresholds, operating_points |
5389 |
Eval orchestration (Tier 2) |
harness, splits, loaders, leakage, analysis |
2698 |
Evidence layer (v0.9) |
claims, artifacts, evidence |
948 |
Reproducibility (Tier 3) |
manifest, provenance, seeds |
797 |
Visualization |
plotting |
956 |
Standalone utilities |
text_dedup, docs, paths, config, protocols, _version |
1973 |
Three modules — metrics (1550), bootstrap (1149), text_dedup
(1327) — are 32% of the LOC.
4. Recommended target shape#
Phase 1 — in-place mono-repo reorganization (target: v0.11.0)#
Restructure src/eval_toolkit/ into 6 sub-packages, one per bucket:
src/eval_toolkit/
├── __init__.py # lazy re-exports preserved (back-compat)
├── _version.py
├── _types.py # NEW: EvalSlice, RunResult (extracted from harness)
├── core/ # math kernels — Tier 1
│ ├── metrics.py
│ ├── bootstrap.py
│ ├── calibration.py
│ ├── thresholds.py
│ └── operating_points.py
├── harness/ # eval orchestration — Tier 2
│ ├── __init__.py # re-exports evaluate, evaluate_folded, EvalSlice
│ ├── orchestrator.py # was harness.py
│ ├── splits.py
│ ├── loaders.py
│ ├── leakage.py
│ └── analysis.py
├── evidence/ # claims / artifacts (v0.9)
│ ├── claims.py
│ ├── artifacts.py
│ └── evidence.py
├── scaffolding/ # reproducibility — Tier 3
│ ├── manifest.py
│ ├── provenance.py
│ └── seeds.py
├── viz/ # plotting (matplotlib-only)
│ └── plotting.py
└── utils/ # standalone tools
├── text_dedup.py
├── docs.py # markdown rendering
├── paths.py
├── config.py
└── protocols.py # the trait crate
Top-level eval_toolkit/__init__.py keeps every existing lazy
export working — consumers see no breaking change. Internally:
Helpers in
harness/depend on_typesinstead of importingharness/orchestrator. Cycle eliminated.A CI lint test (
tests/test_import_directions.py) parsessrc/eval_toolkit/**/*.pyASTs and fails the build if a sub-package imports from a “higher” sub-package per a declared import order:_types < core < (harness, evidence, scaffolding, viz, utils).The
protocolsmodule moves toutils/— it’s a trait crate with zero runtime cost; placing it under utils makes the conceptual map cleaner (it’s a contract used by orchestration but not part of it).
Phase 2 — earn-then-extract (long-term, no committed schedule)#
Three candidates, ranked by independence + value:
eval-toolkit-text-dedup— separate PyPI package + repo. 1327 LOC, sklearn-heavy, zero in-package imports, conceptually orthogonal (it’s a similarity/dedup library, not an eval kernel). The currenteval_toolkit.utils.text_dedupbecomes a thin re-export from the new package + aDeprecationWarning. Trigger: a downstream consumer asks for it standalone, or the sklearn-version pin starts costing the main package.eval-toolkit-plotting— separate PyPI package + repo. 956 LOC, matplotlib-only (already gated behind[plotting]extra), zero in-package imports, conceptually orthogonal. The split lets matplotlib churn (3.x bumps, deprecations) without touching the math core. Trigger: a matplotlib release breaks something and we want to ship a viz-only fix without bumping the main package.eval-toolkit-evidence— separate package, only if v0.10+ evidence-layer adoption proves it has independent consumers. 948 LOC, recently added in v0.9, churning. The gate framework (claims.py) is generic enough to be useful outside eval-toolkit. But: deeply integrated with theRunResultshape, so extraction means freezing that contract first. Trigger: someone wants to use the gate framework against a non-eval-toolkit result payload.
What we’re NOT extracting#
metrics,bootstrap,calibration— too foundational; in-degree 4 onmetricsmeans everything else falls in line behind it. These stay the math core ofeval-toolkit.harness+ the orchestration ring — this is the package’s identity. Extracting it would leave a confusingly-named shell.manifest+provenance+seeds— small, tight Tier-3 pair, no consumer demand to use them outside an eval context.
5. Extraction checklist#
For any module X proposed for extraction into a separate PyPI
package, run all four questions. Extract only if ≥ 2 of 4 are
yes. Default = stay together.
Each question has a concrete command so the audit is reproducible by a human or AI session.
Q2 — Has at least one downstream consumer asked to use X without pulling the rest of eval-toolkit?#
# Search known consumer repos
gh search code "eval_toolkit.<X>" --owner <user>
# Or grep manually if consumer repos are private
Known consumer repos: prompt-injection-detector,
prompt-injection-showcase, prompt-injection-sdd. If any has
filed an issue, asked in chat, or imported just <X> without the
harness, mark yes.
Q3 — Does X have heavy / churning external deps whose churn would otherwise force main-package bumps?#
# Check release cadence of X's primary external dep
pip index versions <X-key-dep>
If the dep released ≥ 2 minor versions in the last 6 months and we’ve seen breakage in the main package because of it, mark yes.
Q4 — Is X’s public API stable enough to commit to a separate release cadence?#
git log --oneline -- src/eval_toolkit/<X>.py | head -20
grep -A 50 "^__all__" src/eval_toolkit/<X>.py
If __all__ hasn’t changed in the last 2 minor releases, mark
yes. (A churning API can’t be split off because every change
forces a new release of the spun-off package.)
Verdict#
0–1 yes — do not extract. Keep X in the main package.
2 yes — extraction is justified. Open a planning round with this doc as the entry point.
3–4 yes — extraction is overdue. Prioritize.
Document the verdict in this doc as an appendix, even if the verdict is “do not extract.” Silence is a valid audit result; an explicit “ran the audit, nothing changed” entry prevents future sessions from re-deriving the same conclusion.
6. Audit cadence#
Cadence: every 3 minor releases. Next audit due at v0.13.0 (current is v0.10.0).
How to run the full audit#
These commands a human or AI session can copy-paste:
# 1. Refresh the dependency graph
grep -rnE "^from eval_toolkit\.|^import eval_toolkit\." src/eval_toolkit/ \
| grep -v "TYPE_CHECKING" \
| sort | uniq
# 2. Per-module LOC (re-confirms the LOC + concern map in §3)
wc -l src/eval_toolkit/**/*.py
# 3. Per-module __all__ size
for f in src/eval_toolkit/**/*.py; do
printf "%-40s " "$f"
grep -A 50 "^__all__" "$f" | grep -c '"[A-Za-z_]'
done
# 4. Coverage per module (re-confirms the v0.10.0 ≥90% per-module floor)
uv run pytest -q --cov=eval_toolkit --cov-report=term --no-header
Then for each candidate module, run the §5 checklist commands.
Audit owner#
Whoever opens a PR on or after the v0.13.0 milestone. The output goes into:
A new appendix to this doc (
§Audit log— added in the same PR that triggers the audit), ORA comment on a tracking issue, with a link from this doc.
Either way, the audit produces a written artifact. If no extraction passes the §5 checklist, write “audit ran at v0.13.0, nothing changed” and move on. Silence is a valid result.
7. Anti-goals#
Things this strategy explicitly rules out:
Don’t split into 6 separate repos at once. Multi-repo coordination overhead (synchronized releases, version pin management, cross-repo PRs) is real; the value has to be earned per-extraction via the §5 checklist.
Don’t propose alternative bucket groupings. The 6-bucket cut in §4 is the agreed shape. Re-litigating it costs more than it buys. If a new module genuinely doesn’t fit any bucket, that’s a signal to write up a §8 open question, not to redraw the map.
Don’t extract
metrics,bootstrap, orcalibrationas separate packages. Foundational math kernels with multiple in-package consumers; circular installs (eval-toolkit-coredepending oneval-toolkit-metricsetc.) are worse than co-location.Don’t add a plugin / entry-point system. The Protocol layer in
eval_toolkit/utils/protocols.pyalready gives consumers Tier 2 extension; entry-points are a lot of machinery for the same outcome.Don’t rename the package.
eval-toolkitstays as the PyPI name. Sub-packages get sensible names within (eval_toolkit.viz,eval_toolkit.core, etc.).Don’t pre-emptively add deprecation warnings to today’s API. No
DeprecationWarninglands until a Phase 2 extraction actually triggers and the back-compat shim is the only thing left in the main package.
Parallelism (v0.34.0+)#
The toolkit was historically single-threaded — not by policy, just because
no primitive needed parallelism. v0.34.0 codifies the explicit pattern:
opt-in per-function n_jobs parameter backed by the internal
_parallel.parallel_map helper (joblib
loky backend; reproducibility-by-default via np.random.SeedSequence).
What this means for contributors:
Adding
n_jobsto a new function is allowed and encouraged when the function has a Python-level loop over independent work units with medium+ per-item cost. Follow the checklist in methodology/parallelism.md §”When to addn_jobs”.Use the helper, don’t roll your own. All parallelism flows through
_parallel.parallel_map; no inlinejoblib.Parallel, noconcurrent.futures, no rawmultiprocessing, noasyncio.Default sequential.
n_jobs: int = 1keeps existing call sites unchanged and preserves reproducibility/traceback fidelity by default.
8. Open questions#
Deferred to the v0.13.0 audit (or earlier if a trigger fires):
Should
analysis.py(CSV/JSONL prediction readers) move toevidence/? It’s used by the evidence-core paired-diff workflow more than by the eval orchestrator. Currently lives inharness/per §4 because it importsbootstrap+metrics, but the consumer shape suggests it belongs withclaims+artifacts.Should
protocols.pybecome its own published package (eval-toolkit-protocols) so consumer libraries can declare a dependency on the contract without depending on any implementation? Useful only if there’s ever a second harness implementation. Today there isn’t.v1.0 timing — when does the API freeze trigger? The §5 checklist’s Q4 (
__all__stability) is the relevant signal.
9. Cross-links#
roadmap.md— broader feature roadmap. This doc is the organizational roadmap; that’s the feature roadmap.methodology/versioning.md— the schema-evolution policy. Any extracted package’s API contract should follow the same additive-fields,additionalProperties: truediscipline documented there.MIGRATION.md— index of per-version migration guides. The v0.11.0 reorg gets its own entry there; future Phase-2 extractions get their own entries too.extending.md— the Protocol-by-Protocol guide for custom Scorers, Splitters, gates. Stays the recommended extension point regardless of how the package is internally organized.