# ADR 0004: Naming conventions **Status:** Accepted at v0.49.0 (the final pre-v1.0 minor) — these conventions are the v1.0 contract for naming. **Date:** 2026-05-23 **Authors:** brandon-behring **Context:** Codified out of an audit + industry-research pass run during `~/.claude/plans/what-git-issues-are-bright-torvalds.md` planning. ## Context eval-toolkit at v0.48.0 had 95–99% consistent naming across the public surface — a real success, not an accident — but the conventions were implicit rather than documented. v1.0 locks the Tier-1 API per [ADR 0003](0003-stability-contract-and-gate3-methodology.md); any inconsistency that ships at v1.0 lives forever (until v2.0). This ADR both documents what's already true AND closes the small remaining gaps flagged by the v0.49.0 audit. The audit also surfaced a v1.0-critical RNG-parameter inconsistency (`seed` in ~15 functions, `random_state` in `stacking.py`) that requires adoption of [Scientific Python SPEC 7](https://scientific-python.org/specs/spec-0007/). That adoption ships in v0.50.0 as a focused release; this ADR documents the convention now so it locks the rule going forward. ## Decisions ### D1 — Module naming Modules in `src/eval_toolkit/*.py` are flat per [ADR 0001](0001-flat-module-layout.md). Within that constraint: - **Plural noun for collection-of-types modules** — modules holding multiple related types of the same kind: `metrics`, `loaders`, `protocols`, `losses`, `probes`, `splits`, `paths`, `seeds`, `thresholds`, `artifacts`, `claims`, `embeddings`, `scorecards` (new at v0.49.0; was `_scorecard`). - **Singular noun for domain-concept modules** — modules holding a single coherent concept: `harness`, `bootstrap`, `manifest`, `calibration`, `leakage`, `analysis`, `provenance`, `evidence`, `stacking`, `text_dedup`. - **Gerund for process-domain modules** — modules describing an operation: `preprocessing`. - **Private modules** carry a leading underscore: `_parallel.py`, `_deprecated.py`, `_version.py`, `_rng.py`, `_sweep.py`. These are not in `_EXPORTS`; their public symbols (if any) are accessed only via top-level `from eval_toolkit import X`. The asymmetric-promotion sub-rule from [ADR 0001](0001-flat-module-layout.md) controls when a private module should be promoted to public: collection-of-types MAY promote; single-function-only SHOULD stay underscore. ### D2 — Class naming Classes follow `PascalCase` per PEP 8. Within that, eval-toolkit uses **domain-specific suffixes** that map to Protocol contracts: | Suffix | Domain | Examples | Protocol | |---|---|---|---| | `*Selector` | Threshold selection | `MaxF1Selector`, `CISafeThresholdSelector`, `YoudenJSelector` | `ThresholdSelector` | | `*Splitter` | Cross-validation splits | `HoldoutSplitter`, `StratifiedKFoldSplitter`, `PurgedKFoldSplitter` | `Splitter` | | `*Check` | Leakage detection | `ExactDuplicateCheck`, `TokenizationLeakageCheck`, `LabelConflictCheck` | `LeakageCheck` | | `*Loader` | Dataset loading | `DataFrameLoader`, `HFDatasetsLoader`, `ParquetGlobLoader` | `DatasetLoader` | | `*Reader` | Prediction artifact reading | `CsvPredictionReader`, `JsonlPredictionReader` | `PredictionReader` | | `*Variant` | Preprocessing variant | `DelimitVariant`, `DatamarkVariant`, `EncodeVariant` | (functional API) | | `*Strategy` | Dedup similarity backend | `TfidfCosineStrategy`, `EmbeddingCosineStrategy`, `MinHashLSHStrategy` | `SimilarityStrategy` | | `*Injection` | Adversarial char-injection attack | `ZeroWidthSpaceInjection`, `BidiRTLInjection`, `TokenSplittingInjection`, `UnicodeNormalizationInjection`, `CaseInjection` (v0.49 renames) | `TextTransform` | | `*Substitution` | Adversarial char-substitution attack | `HomoglyphSubstitution`, `SynonymSubstitution`, `DiacriticInjection` *(historical: `DiacriticInjection` is a substitution by mechanism; the name preserves continuity with the v0.43 release)* | `TextTransform` | **Result/output dataclasses** use the suffix `*Result`, `*CI`, `*Estimate`, or `*Report` where the type genuinely is a result object; exception cases (`Scorecard`, `WilsonInterval`) are intentional — `Scorecard` is the named domain concept, not a generic result, and `WilsonInterval` is the math-term name. Document exceptions in the class docstring. **Config/metadata** dataclasses use `*Spec`, `*Metadata`, or `*Manifest` per the domain. ### D3 — Function naming Functions use `snake_case` per PEP 8. Prefer **verb-prefix for factories** when the name is action-shaped: | Prefix | Use | Examples | |---|---|---| | `make_*` | Construct an object | `make_minilm_embedder`, `make_palette`, `make_run_dir`, `make_manifest` (v0.49 rename) | | `fit_*` | Fit a model/calibrator | `fit_temperature`, `fit_platt_binary`, `fit_beta_binary`, `fit_isotonic_binary`, `fit_operating_points` | | `evaluate_*` | Run an evaluation harness | `evaluate`, `evaluate_folded`, `evaluate_claims` | | `plot_*` | Render a figure | `plot_pr_curve`, `plot_roc_curve`, `plot_confusion_matrix_grid` | | `write_*` | Serialize to disk | `write_manifest`, `write_run_result`, `write_json_strict` | | `validate_*` | Check well-formedness | `validate_manifest`, `validate_payload`, `validate_source_roles` | | `bootstrap_*` | Resample-based CI | `bootstrap_ci`, `bootstrap_metric_from_predictions` | | `paired_*` | Paired-bootstrap difference | `paired_bootstrap_diff`, `paired_bootstrap_ece_diff`, `paired_mde` | | `load_*` | I/O read | `load_prediction_arrays` | | `compute_*` | Pure derivation | `compute_label_overlap`, `compute_file_hash` | **Noun-form is OK** for scalar metric functions per [ADR 0002](0002-scorecard-as-primary-metric-surface.md): `pr_auc`, `roc_auc`, `brier_score`, `recall_at_fpr`, `wilson_interval`. These are sklearn-aligned and would be awkward as `compute_pr_auc(...)`. ### D4 — Parameter naming (the canonical list) Locked at v1.0 — these names mean these things, everywhere: | Parameter | Meaning | |---|---| | `y_true` | Ground-truth labels (binary, shape `(n,)`) | | `y_score` | Continuous score / probability (shape `(n,)`) | | `y_pred` | Discrete prediction (when threshold-dependent) | | `n_resamples` | Bootstrap iteration count | | `confidence` | Two-sided confidence level (0.95 default) | | `n_bins` | Binning count for calibration / ECE | | `n_jobs` | Parallelism (joblib convention; sklearn-aligned) | | `ax` | Matplotlib axis (matplotlib convention) | | `metric` | Callable `(y_true, y_score) -> float` | | `rng` | RNG argument per SPEC 7 — **canonical** convention (adopted v0.50.0). Accepts `int \| np.random.Generator \| BitGenerator \| SeedSequence \| None`. | | `seed` | _legacy name_ used through v0.49 — replaced by `rng` at v0.50.0 across ~22 Tier-1 sites. EXCEPTIONS where `seed` is retained: `seeds.set_global_seeds(seed: int)` (global-state setter; SPEC 7 doesn't apply), adversarial dataclass fields (use Python `random.Random(seed)`; not NumPy-RNG), Splitter dataclass class-fields (configuration storage, not user-facing RNG parameter), `loaders.py` YAML config key. | **Future functions MUST use these names.** A PR that introduces `labels=` (instead of `y_true=`), `scores=` (instead of `y_score=`), `alpha=` (instead of `confidence=`), or any deviation must justify it in the PR description or rename to the canonical name. **Known deviations (recorded 2026-06-12, v1.12.0):** `cluster_bootstrap_ci(statistic=)` (shipped v1.7.0) deviates from the canonical `metric` for the `(y_true, y_score) -> float` callable role — 7 siblings in `bootstrap.py` use `metric`. The signature is Tier-1-frozen per [ADR 0003](0003-stability-contract-and-gate3-methodology.md) (snapshot-pinned; a deprecation alias would itself mutate the Tier-1 signature), so the rename is queued for v2.0 in the [v2.0 breaking-changes queue (#116)](https://github.com/brandon-behring/eval-toolkit/issues/116), which also carries the related `per_stratum_metric` naming discussion. The eda `random_state`/`n_bootstrap` deviations flagged by the same audit were closed at v1.12.0 via the DEPRECATION.md one-time-exception path (#100). ### D5 — Constants `UPPER_SNAKE_CASE` per PEP 8. Tier-1 constants in `_EXPORTS` include `DEFAULT_SEED`, `DEFAULT_N_RESAMPLES`, `DEFAULT_CONFIDENCE`, `DEFAULT_METHOD`, `MANIFEST_SCHEMA_VERSION`, `CORE_TECHNIQUES`, `ADVANCED_TECHNIQUES`, `ALL_TECHNIQUES`. ### D6 — Protocol naming Protocols follow `PascalCase` per PEP 8 and are **named semantically** per [ADR 0003](0003-stability-contract-and-gate3-methodology.md) §1. There is no forced uniform suffix — the *contract* is the method shape, not the name shape. The 10 strict Tier-2 Protocols + 1 opt-in named per their role: `Scorer`, `LeakageCheck`, `Splitter`, `ThresholdSelector`, `DatasetLoader`, `MetricSpec`, `TextTransform`, `MetaLearner`, `Probe`, `SimilarityStrategy`, `Versioned`. (Note: `SimilarityStrategy` promoted from "pre-v0.7 internal" to strict Tier-2 at v1.0.2 per RC2 reconciliation; #76.) ### D7 — TypeVars Internal `TypeVar`s use a leading underscore per Google Python Style Guide §3.19.10: `_T = TypeVar("_T")`. Public, constrained `TypeVar`s without underscore are allowed when explicitly part of an exported API. ### D8 — Fitted estimator attributes (sklearn alignment) Estimator-style classes (`fit`/`predict` pattern) that store **learned-from-data attributes** use **trailing underscore** per scikit-learn convention: `coef_`, `classes_`, `n_features_in_`, `feature_importances_`. These attributes MUST NOT be set in `__init__` — set them only in `fit()`. (See [scikit-learn Developing estimators](https://scikit-learn.org/stable/developers/develop.html#fitted-attributes).) Frozen reference-impl dataclasses (`@dataclass(frozen=True, slots=True)`) are exempt — they hold config, not fitted state. `stacking.LogisticStacker` is the current canonical example. ### D9 — Docstring style NumPy docstring format per STYLE.md §12. **Prose wraps at 75 cols** (numpydoc convention); doctest code blocks follow the 100-col Black rule. The 75-col rule keeps docstrings readable in a terminal `help()` call; the 100-col rule keeps code in docstrings readable in a normal editor. ### D10 — Test naming `tests/test_.py` mirrors `src/eval_toolkit/.py`. Test functions are `test__`. No class-based test grouping unless fixtures truly demand it (rare in this codebase). ## Industry alignment The conventions above were verified against canonical sources during the v0.49.0 audit: - [PEP 8 — Style Guide for Python Code](https://peps.python.org/pep-0008/) — module/class/function/variable/constant naming, leading-underscore privacy. - [scikit-learn Developing estimators](https://scikit-learn.org/stable/developers/develop.html) — fit/predict contract, trailing-underscore for fitted attributes, parameter names `y_true` / `y_pred` / `y_score` / `n_jobs`. - [numpydoc Format Spec](https://numpydoc.readthedocs.io/en/latest/format.html) — docstring sections + 75-col prose rule. - [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html) — TypeVar leading-underscore (§3.19.10) for internal types. - [Scientific Python SPEC 7 — Seeding pseudo-random number generation](https://scientific-python.org/specs/spec-0007/) — `rng: RNGLike | SeedLike | None` parameter convention (adopted in v0.50.0; documented here so the rule locks now). Eval-toolkit deviates from industry conventions in four places, all intentional and documented: 1. **Unicode math identifiers** (`π`, `θ`, `μ`, `σ`, `α`, `β`) are permitted in math kernels with required English-comment alias — per STYLE.md §3 and §16. PEP 8 forbids non-ASCII; eval-toolkit's math-paper-fidelity domain justifies the exception. 2. **`set_global_seeds(seed: int)`** keeps `seed` even after the v0.50.0 SPEC 7 adoption — this is a global-state setter, not a per-function RNG argument, so SPEC 7 doesn't apply. 3. **Adversarial dataclass fields** keep `seed: int = 42` because they use Python's stdlib `random.Random(seed)`, not NumPy's `Generator`. SPEC 7's typing (`RNGLike = np.random.Generator | ...`) is strictly NumPy-scoped. 4. **`cluster_bootstrap_ci(statistic=)`** keeps the scipy-flavored `statistic` name (scipy.stats.bootstrap uses it) despite §D4 reserving `metric` for this callable role — not by intent but by Tier-1 freeze; see the Known-deviations note under D4 and the v2.0 queue ([#116](https://github.com/brandon-behring/eval-toolkit/issues/116)). ## Forward enforcement Every PR that adds a new public symbol MUST satisfy these conventions or document the exception in the PR description. There is no automated lint enforcement yet — that is a v1.x candidate (deferred N8 in the v0.49.0 plan; tracked as a future-improvement issue once the patterns have settled across a couple of v1.x minors). ## References - [ADR 0001 — Flat module layout](0001-flat-module-layout.md) (cross-references the asymmetric-promotion principle). - [ADR 0002 — Scorecard as primary metric surface](0002-scorecard-as-primary-metric-surface.md). - [ADR 0003 — Stability contract + Gate 3 methodology](0003-stability-contract-and-gate3-methodology.md) (defines the Tier-1/2/3 framework these naming rules operate inside). - `STYLE.md` (root) — contributor-facing daily reference; this ADR is the decision record. - `~/.claude/plans/what-git-issues-are-bright-torvalds.md` — the v0.49.0 plan that produced this ADR.