# ADR 0001: Flat single-file modules through v1.x **Status:** Accepted **Date:** 2026-05-21 (drafted at v0.46-prep; finalized at v0.48) **Deciders:** Brandon Behring (author), Round 5 / Round 6 / Round 7 audits (Codex + Gemini) **Supersedes:** N/A. **Superseded by:** N/A. ## Context eval-toolkit organizes its public API as **flat single-file modules** under `src/eval_toolkit/`. Every concern — metrics, bootstrap, calibration, harness, leakage, splits, thresholds, stacking, scorecard, etc. — lives in exactly one `.py` file, often several hundred to a couple thousand lines. The v1.0 plan ([[evaluate-all-the-work-twinkly-kite]]) inherits this layout from v0.x. Two pressure tests against the flat-module convention surfaced during planning + Round 5 audit: 1. **Module size.** A ground-truth check at the planning round measured 9 modules exceeding 800 LOC; at v0.48 the count is similar (`metrics.py` 1863, `bootstrap.py` 1849, `calibration.py` 1515, `harness.py` 1449, `text_dedup.py` 1403, `plotting.py` 1351, `leakage.py` 1215, `loaders.py` 1080, `thresholds.py` 953). The earlier "800 LOC trigger" criterion was already violated by current code at the planning round and remains so at v0.48 — nine production modules sit comfortably above the line without becoming discoverability problems. 2. **Functional grouping.** v0.46 introduced `_scorecard.py` and `metric_specs.py`; v0.47 added a top-level `_sweep.py` + `TextTransform` Protocol that unifies attacks and preprocessing under one strategy surface (`adversarial.py` ships 12 character-injection techniques via `ALL_TECHNIQUES`). The question naturally arises: should "metrics" become a subpackage with `metrics/scorecard.py`, `metrics/specs.py`, `metrics/calibration.py`, etc.? Same for `attacks/` (adversarial + preprocessing) or `bootstrap/`. Either move (size-driven or grouping-driven) would be **breaking** at v1.0 because consumer imports follow the flat module structure (`from eval_toolkit.metrics import pr_auc`, not `from eval_toolkit.metrics.calibration_metrics import pr_auc`). ## Decision **Stay flat through v1.x.** v1.0 commits to the existing flat-module layout; subpackage restructuring is deferred until at least one of three concrete triggers (below) fires, at which point v2.0 is the appropriate boundary. Specifically: 1. The current `src/eval_toolkit/*.py` file structure is the v1.0 contract for *internal-API* module paths. `eval_toolkit.metrics`, `eval_toolkit.bootstrap`, `eval_toolkit.calibration`, etc., remain valid submodule paths through the v1.x release line. 2. The **strict** v1.0 contract is still scoped to top-level `__all__` + Tier-2 Protocols (per [ADR 0003](0003-stability-contract-and-gate3-methodology.md)). Submodule public symbols carry an *additive-only* commitment — they may gain functionality in v1.x minors but the existing path remains importable. 3. Internal helpers (anything not in any module's `__all__`) are free to move in v1.x minors. ## Trigger criteria for a v2.0 subpackage restructure The flat-module convention holds until ANY of the following becomes true (none of which currently apply): ### A. Second production consumer with materially different surface needs The single-consumer / breaking-OK rationale that made the v0.46 → v1.0 breaking sequence cheap stops holding once a second downstream consumer adopts eval-toolkit with materially different access patterns. If the second consumer would benefit from a subpackage layout (e.g., importing from `eval_toolkit.attacks` rather than `eval_toolkit.adversarial` + `eval_toolkit.preprocessing` separately), the v2.0 prep cycle is the appropriate time to evaluate. ### B. Clear functional grouping that the codebase asks for Examples that could surface during v1.x development: - **`attacks/`** — adversarial + preprocessing (defense) modules collapse into a single attack-surface namespace as `sweep()` + `TextTransform` become the dominant interfaces. - **`calibration/`** — the 4-binary-adapter family (`fit_temperature_binary`, `fit_isotonic_binary`, `fit_platt_binary`, `fit_beta_binary`), the underlying calibrator fitters, and the calibration-error metrics (ECE family) split across `metrics.py` + `calibration.py` could naturally group. - **`bootstrap/`** — `bootstrap_ci`, `paired_bootstrap_*`, `block_bootstrap_on_folds`, `DeLongResult`, `BootstrapCI`, `PairedBootstrapCI`, MDE estimation, and CV-CLT could group around the inference primitive theme. None of these have been demanded by the consumer or by audit feedback through v0.47.0 (Round 5 + Round 6 + Round 7 audits did not flag layout as a concern). The trigger is "asks for it," not "could conceivably be grouped." ### C. Discoverability complaint from real users If multiple users file issues like "I can't find where X is" or "the module name doesn't match what it does," subpackage navigation may help. The trigger is **two or more such issues** from independent reporters — not internal tooling concerns or single-author preference. ### Not a trigger: per-module line count The original v1.0 plan draft cited a "module crosses ~800 LOC" trigger. Verification at planning time found 9 modules already exceed that — the trigger criterion was already violated, but nobody (consumer or audit) flagged module size as a problem. The pattern that emerged: eval-toolkit's large modules are **cohesive**. `metrics.py` is large because it contains every binary-classification metric primitive; that's the right home for them. `bootstrap.py` is large because it contains every bootstrap-CI primitive; same story. `text_dedup.py` is large because it contains the full similarity-strategy + leakage-check decomposition for one logical problem. Splitting a cohesive module on size alone produces arbitrary boundaries that don't help discoverability. v1.0 explicitly rejects size as a splitting trigger. ## Consequences **Positive:** - One module path per concern. `from eval_toolkit.metrics import pr_auc` is the canonical scalar-metric import; `from eval_toolkit.bootstrap import bootstrap_ci` is the canonical inference import. Consumers don't need to learn a deeper module taxonomy. - v1.0 → v1.x can grow without coordinating subpackage moves. - Test files mirror the module structure 1:1 (`tests/test_metrics.py`, etc.), which is easy to reason about. **Negative:** - `metrics.py` at ~1850 lines is genuinely cumbersome to navigate in an IDE. The mitigation is `__all__` (already in place) + the `metric_specs` namespace (v0.46+) which lets consumers reach individual metrics through a smaller surface without the IDE-navigation cost. - A future second consumer may want a different organizational vocabulary; v2.0 is the appropriate boundary. ## Alternatives considered ### Subpackage restructure at v1.0 Split the largest modules into subpackages (`metrics/`, `bootstrap/`, `calibration/`, `harness/`, `text_dedup/`). Rejected because: - v1.0 commits to API stability; a flat→subpackage shift is **exactly** the kind of breaking change v1.0 should NOT introduce (a v0.46-style soft-deprecation shim would need to re-export from old paths for multiple releases). - The single consumer reports no discoverability problems with the flat layout. - Each subpackage would require its own `__init__.py` re-export choices, multiplying the public-API surface to maintain. ### Defer the layout decision past v1.0 without ADR Make no commitment in either direction; revisit ad-hoc. Rejected because v1.0 IS the API stability commitment — the subpackage question must be explicitly resolved one way or the other. This ADR is the resolution. ### Hybrid: subpackage some, keep others flat E.g., split `metrics/` but keep `bootstrap.py` flat. Rejected because the hybrid pattern is harder to teach and creates inconsistent expectations: new users would have to look up every module to know whether it's a single file or a directory. ## Trigger to revisit This ADR is locked at v1.0. Revisiting requires **SemVer-major (v2.0)**. Specific triggers per the criteria above (any ONE is sufficient): 1. Second production consumer with materially different surface needs. 2. Functional grouping the codebase asks for. With sweep + TextTransform shipped at v0.47, the `attacks/` grouping (adversarial + preprocessing under one namespace) becomes the most natural candidate; nothing has demanded it as of v0.48, so the trigger stays cold pending consumer feedback in the v1.x cycle. 3. ≥2 independent discoverability complaints from real users. ## Asymmetric module promotion (v0.49.0 amendment) Per v0.49.0 (R4-A in `~/.claude/plans/what-git-issues-are-bright-torvalds.md`), this ADR is amended with one concrete sub-rule for private-to-public module promotion: **Collection-of-types private modules MAY be promoted to plural-public. Single-function private modules SHOULD stay underscore.** The asymmetry is justified by export-count + user-value: - A private module that holds **multiple public types** (e.g., `_scorecard.py` at v0.46–v0.48 held `Scorecard`, `MetricSpec`, `MetricResult`, `scorecard`) earns its public module path: power users can write `from eval_toolkit.scorecards import Scorecard` for type annotations, sphinx autodoc, and namespace hygiene. - A private module that holds **one public function only** (e.g., `_sweep.py` holds just the `sweep()` callable) does not benefit from a submodule path: nothing to extract that isn't already accessible via top-level. Promotion would add forever-Tier-2 contract surface for zero user value. Concretely at v0.49.0: - `_scorecard.py` → `scorecards.py` (PROMOTED — plural noun consistent with `metrics`, `loaders`, `protocols` collection-modules). - `_sweep.py` → STAYS as `_sweep.py` (1 export; underscore stays). The naming convention (plural noun for collection-of-types modules) is documented separately in [ADR 0004 — Naming conventions](0004-naming-conventions.md). ## References - v1.0 plan: `~/.claude/plans/evaluate-all-the-work-twinkly-kite.md` (Decision 2 + audit revision via Round 5 module-size ground-truth). - Round 5 audit ledger: [`audit_findings.md`](../audit_findings.md) — Round 5 module-size verification. - [ADR 0002 — scorecard as primary metric surface](0002-scorecard-as-primary-metric-surface.md) — companion ADR documenting the v0.46 surface design within the flat-module convention. - [ADR 0003 — stability contract + Gate 3 methodology](0003-stability-contract-and-gate3-methodology.md) — defines the tiered v1.0 stability commitment that this ADR's flat-module choice operates inside.