ADR 0001: Flat single-file modules through v1.x#

Status: Accepted Date: 2026-05-21 (drafted at v0.46-prep; finalized at v0.48) Deciders: Brandon Behring (author), Round 5 / Round 6 / Round 7 audits (Codex + Gemini) Supersedes: N/A. Superseded by: N/A.

Context#

eval-toolkit organizes its public API as flat single-file modules under src/eval_toolkit/. Every concern — metrics, bootstrap, calibration, harness, leakage, splits, thresholds, stacking, scorecard, etc. — lives in exactly one .py file, often several hundred to a couple thousand lines.

The v1.0 plan ([[evaluate-all-the-work-twinkly-kite]]) inherits this layout from v0.x. Two pressure tests against the flat-module convention surfaced during planning + Round 5 audit:

Module size. A ground-truth check at the planning round measured 9 modules exceeding 800 LOC; at v0.48 the count is similar (metrics.py 1863, bootstrap.py 1849, calibration.py 1515, harness.py 1449, text_dedup.py 1403, plotting.py 1351, leakage.py 1215, loaders.py 1080, thresholds.py 953). The earlier “800 LOC trigger” criterion was already violated by current code at the planning round and remains so at v0.48 — nine production modules sit comfortably above the line without becoming discoverability problems.
Functional grouping. v0.46 introduced _scorecard.py and metric_specs.py; v0.47 added a top-level _sweep.py + TextTransform Protocol that unifies attacks and preprocessing under one strategy surface (adversarial.py ships 12 character-injection techniques via ALL_TECHNIQUES). The question naturally arises: should “metrics” become a subpackage with metrics/scorecard.py, metrics/specs.py, metrics/calibration.py, etc.? Same for attacks/ (adversarial + preprocessing) or bootstrap/.

Either move (size-driven or grouping-driven) would be breaking at v1.0 because consumer imports follow the flat module structure (from eval_toolkit.metrics import pr_auc, not from eval_toolkit.metrics.calibration_metrics import pr_auc).

Decision#

Stay flat through v1.x. v1.0 commits to the existing flat-module layout; subpackage restructuring is deferred until at least one of three concrete triggers (below) fires, at which point v2.0 is the appropriate boundary.

Specifically:

The current src/eval_toolkit/*.py file structure is the v1.0 contract for internal-API module paths. eval_toolkit.metrics, eval_toolkit.bootstrap, eval_toolkit.calibration, etc., remain valid submodule paths through the v1.x release line.
The strict v1.0 contract is still scoped to top-level __all__ + Tier-2 Protocols (per ADR 0003). Submodule public symbols carry an additive-only commitment — they may gain functionality in v1.x minors but the existing path remains importable.
Internal helpers (anything not in any module’s __all__) are free to move in v1.x minors.

Trigger criteria for a v2.0 subpackage restructure#

The flat-module convention holds until ANY of the following becomes true (none of which currently apply):

A. Second production consumer with materially different surface needs#

The single-consumer / breaking-OK rationale that made the v0.46 → v1.0 breaking sequence cheap stops holding once a second downstream consumer adopts eval-toolkit with materially different access patterns. If the second consumer would benefit from a subpackage layout (e.g., importing from eval_toolkit.attacks rather than eval_toolkit.adversarial + eval_toolkit.preprocessing separately), the v2.0 prep cycle is the appropriate time to evaluate.

B. Clear functional grouping that the codebase asks for#

Examples that could surface during v1.x development:

attacks/ — adversarial + preprocessing (defense) modules collapse into a single attack-surface namespace as sweep() + TextTransform become the dominant interfaces.
calibration/ — the 4-binary-adapter family (fit_temperature_binary, fit_isotonic_binary, fit_platt_binary, fit_beta_binary), the underlying calibrator fitters, and the calibration-error metrics (ECE family) split across metrics.py + calibration.py could naturally group.
bootstrap/ — bootstrap_ci, paired_bootstrap_*, block_bootstrap_on_folds, DeLongResult, BootstrapCI, PairedBootstrapCI, MDE estimation, and CV-CLT could group around the inference primitive theme.

None of these have been demanded by the consumer or by audit feedback through v0.47.0 (Round 5 + Round 6 + Round 7 audits did not flag layout as a concern). The trigger is “asks for it,” not “could conceivably be grouped.”

C. Discoverability complaint from real users#

If multiple users file issues like “I can’t find where X is” or “the module name doesn’t match what it does,” subpackage navigation may help. The trigger is two or more such issues from independent reporters — not internal tooling concerns or single-author preference.

Not a trigger: per-module line count#

The original v1.0 plan draft cited a “module crosses ~800 LOC” trigger. Verification at planning time found 9 modules already exceed that — the trigger criterion was already violated, but nobody (consumer or audit) flagged module size as a problem.

The pattern that emerged: eval-toolkit’s large modules are cohesive. metrics.py is large because it contains every binary-classification metric primitive; that’s the right home for them. bootstrap.py is large because it contains every bootstrap-CI primitive; same story. text_dedup.py is large because it contains the full similarity-strategy + leakage-check decomposition for one logical problem.

Splitting a cohesive module on size alone produces arbitrary boundaries that don’t help discoverability. v1.0 explicitly rejects size as a splitting trigger.

Consequences#

Positive:

One module path per concern. from eval_toolkit.metrics import pr_auc is the canonical scalar-metric import; from eval_toolkit.bootstrap import bootstrap_ci is the canonical inference import. Consumers don’t need to learn a deeper module taxonomy.
v1.0 → v1.x can grow without coordinating subpackage moves.
Test files mirror the module structure 1:1 (tests/test_metrics.py, etc.), which is easy to reason about.

Negative:

metrics.py at ~1850 lines is genuinely cumbersome to navigate in an IDE. The mitigation is __all__ (already in place) + the metric_specs namespace (v0.46+) which lets consumers reach individual metrics through a smaller surface without the IDE-navigation cost.
A future second consumer may want a different organizational vocabulary; v2.0 is the appropriate boundary.

Alternatives considered#

Subpackage restructure at v1.0#

Split the largest modules into subpackages (metrics/, bootstrap/, calibration/, harness/, text_dedup/). Rejected because:

v1.0 commits to API stability; a flat→subpackage shift is exactly the kind of breaking change v1.0 should NOT introduce (a v0.46-style soft-deprecation shim would need to re-export from old paths for multiple releases).
The single consumer reports no discoverability problems with the flat layout.
Each subpackage would require its own __init__.py re-export choices, multiplying the public-API surface to maintain.

Defer the layout decision past v1.0 without ADR#

Make no commitment in either direction; revisit ad-hoc. Rejected because v1.0 IS the API stability commitment — the subpackage question must be explicitly resolved one way or the other. This ADR is the resolution.

Hybrid: subpackage some, keep others flat#

E.g., split metrics/ but keep bootstrap.py flat. Rejected because the hybrid pattern is harder to teach and creates inconsistent expectations: new users would have to look up every module to know whether it’s a single file or a directory.

Trigger to revisit#

This ADR is locked at v1.0. Revisiting requires SemVer-major (v2.0). Specific triggers per the criteria above (any ONE is sufficient):

Second production consumer with materially different surface needs.
Functional grouping the codebase asks for. With sweep + TextTransform shipped at v0.47, the attacks/ grouping (adversarial + preprocessing under one namespace) becomes the most natural candidate; nothing has demanded it as of v0.48, so the trigger stays cold pending consumer feedback in the v1.x cycle.
≥2 independent discoverability complaints from real users.

Asymmetric module promotion (v0.49.0 amendment)#

Per v0.49.0 (R4-A in ~/.claude/plans/what-git-issues-are-bright-torvalds.md), this ADR is amended with one concrete sub-rule for private-to-public module promotion:

Collection-of-types private modules MAY be promoted to plural-public. Single-function private modules SHOULD stay underscore.

The asymmetry is justified by export-count + user-value:

A private module that holds multiple public types (e.g., _scorecard.py at v0.46–v0.48 held Scorecard, MetricSpec, MetricResult, scorecard) earns its public module path: power users can write from eval_toolkit.scorecards import Scorecard for type annotations, sphinx autodoc, and namespace hygiene.
A private module that holds one public function only (e.g., _sweep.py holds just the sweep() callable) does not benefit from a submodule path: nothing to extract that isn’t already accessible via top-level. Promotion would add forever-Tier-2 contract surface for zero user value.

Concretely at v0.49.0:

_scorecard.py → scorecards.py (PROMOTED — plural noun consistent with metrics, loaders, protocols collection-modules).
_sweep.py → STAYS as _sweep.py (1 export; underscore stays).

The naming convention (plural noun for collection-of-types modules) is documented separately in ADR 0004 — Naming conventions.

References#

v1.0 plan: ~/.claude/plans/evaluate-all-the-work-twinkly-kite.md (Decision 2 + audit revision via Round 5 module-size ground-truth).
Round 5 audit ledger: audit_findings.md — Round 5 module-size verification.
ADR 0002 — scorecard as primary metric surface — companion ADR documenting the v0.46 surface design within the flat-module convention.
ADR 0003 — stability contract + Gate 3 methodology — defines the tiered v1.0 stability commitment that this ADR’s flat-module choice operates inside.