ADR 0004: Naming conventions#

Status: Accepted at v0.49.0 (the final pre-v1.0 minor) — these conventions are the v1.0 contract for naming.

Date: 2026-05-23

Authors: brandon-behring

Context: Codified out of an audit + industry-research pass run during ~/.claude/plans/what-git-issues-are-bright-torvalds.md planning.

Context#

eval-toolkit at v0.48.0 had 95–99% consistent naming across the public surface — a real success, not an accident — but the conventions were implicit rather than documented. v1.0 locks the Tier-1 API per ADR 0003; any inconsistency that ships at v1.0 lives forever (until v2.0). This ADR both documents what’s already true AND closes the small remaining gaps flagged by the v0.49.0 audit.

The audit also surfaced a v1.0-critical RNG-parameter inconsistency (seed in ~15 functions, random_state in stacking.py) that requires adoption of Scientific Python SPEC 7. That adoption ships in v0.50.0 as a focused release; this ADR documents the convention now so it locks the rule going forward.

Decisions#

D1 — Module naming#

Modules in src/eval_toolkit/*.py are flat per ADR 0001. Within that constraint:

  • Plural noun for collection-of-types modules — modules holding multiple related types of the same kind: metrics, loaders, protocols, losses, probes, splits, paths, seeds, thresholds, artifacts, claims, embeddings, scorecards (new at v0.49.0; was _scorecard).

  • Singular noun for domain-concept modules — modules holding a single coherent concept: harness, bootstrap, manifest, calibration, leakage, analysis, provenance, evidence, stacking, text_dedup.

  • Gerund for process-domain modules — modules describing an operation: preprocessing.

  • Private modules carry a leading underscore: _parallel.py, _deprecated.py, _version.py, _rng.py, _sweep.py. These are not in _EXPORTS; their public symbols (if any) are accessed only via top-level from eval_toolkit import X.

The asymmetric-promotion sub-rule from ADR 0001 controls when a private module should be promoted to public: collection-of-types MAY promote; single-function-only SHOULD stay underscore.

D2 — Class naming#

Classes follow PascalCase per PEP 8. Within that, eval-toolkit uses domain-specific suffixes that map to Protocol contracts:

Suffix

Domain

Examples

Protocol

*Selector

Threshold selection

MaxF1Selector, CISafeThresholdSelector, YoudenJSelector

ThresholdSelector

*Splitter

Cross-validation splits

HoldoutSplitter, StratifiedKFoldSplitter, PurgedKFoldSplitter

Splitter

*Check

Leakage detection

ExactDuplicateCheck, TokenizationLeakageCheck, LabelConflictCheck

LeakageCheck

*Loader

Dataset loading

DataFrameLoader, HFDatasetsLoader, ParquetGlobLoader

DatasetLoader

*Reader

Prediction artifact reading

CsvPredictionReader, JsonlPredictionReader

PredictionReader

*Variant

Preprocessing variant

DelimitVariant, DatamarkVariant, EncodeVariant

(functional API)

*Strategy

Dedup similarity backend

TfidfCosineStrategy, EmbeddingCosineStrategy, MinHashLSHStrategy

SimilarityStrategy

*Injection

Adversarial char-injection attack

ZeroWidthSpaceInjection, BidiRTLInjection, TokenSplittingInjection, UnicodeNormalizationInjection, CaseInjection (v0.49 renames)

TextTransform

*Substitution

Adversarial char-substitution attack

HomoglyphSubstitution, SynonymSubstitution, DiacriticInjection (historical: DiacriticInjection is a substitution by mechanism; the name preserves continuity with the v0.43 release)

TextTransform

Result/output dataclasses use the suffix *Result, *CI, *Estimate, or *Report where the type genuinely is a result object; exception cases (Scorecard, WilsonInterval) are intentional — Scorecard is the named domain concept, not a generic result, and WilsonInterval is the math-term name. Document exceptions in the class docstring.

Config/metadata dataclasses use *Spec, *Metadata, or *Manifest per the domain.

D3 — Function naming#

Functions use snake_case per PEP 8. Prefer verb-prefix for factories when the name is action-shaped:

Prefix

Use

Examples

make_*

Construct an object

make_minilm_embedder, make_palette, make_run_dir, make_manifest (v0.49 rename)

fit_*

Fit a model/calibrator

fit_temperature, fit_platt_binary, fit_beta_binary, fit_isotonic_binary, fit_operating_points

evaluate_*

Run an evaluation harness

evaluate, evaluate_folded, evaluate_claims

plot_*

Render a figure

plot_pr_curve, plot_roc_curve, plot_confusion_matrix_grid

write_*

Serialize to disk

write_manifest, write_run_result, write_json_strict

validate_*

Check well-formedness

validate_manifest, validate_payload, validate_source_roles

bootstrap_*

Resample-based CI

bootstrap_ci, bootstrap_metric_from_predictions

paired_*

Paired-bootstrap difference

paired_bootstrap_diff, paired_bootstrap_ece_diff, paired_mde

load_*

I/O read

load_prediction_arrays

compute_*

Pure derivation

compute_label_overlap, compute_file_hash

Noun-form is OK for scalar metric functions per ADR 0002: pr_auc, roc_auc, brier_score, recall_at_fpr, wilson_interval. These are sklearn-aligned and would be awkward as compute_pr_auc(...).

D4 — Parameter naming (the canonical list)#

Locked at v1.0 — these names mean these things, everywhere:

Parameter

Meaning

y_true

Ground-truth labels (binary, shape (n,))

y_score

Continuous score / probability (shape (n,))

y_pred

Discrete prediction (when threshold-dependent)

n_resamples

Bootstrap iteration count

confidence

Two-sided confidence level (0.95 default)

n_bins

Binning count for calibration / ECE

n_jobs

Parallelism (joblib convention; sklearn-aligned)

ax

Matplotlib axis (matplotlib convention)

metric

Callable (y_true, y_score) -> float

rng

RNG argument per SPEC 7 — canonical convention (adopted v0.50.0). Accepts int | np.random.Generator | BitGenerator | SeedSequence | None.

seed

legacy name used through v0.49 — replaced by rng at v0.50.0 across ~22 Tier-1 sites. EXCEPTIONS where seed is retained: seeds.set_global_seeds(seed: int) (global-state setter; SPEC 7 doesn’t apply), adversarial dataclass fields (use Python random.Random(seed); not NumPy-RNG), Splitter dataclass class-fields (configuration storage, not user-facing RNG parameter), loaders.py YAML config key.

Future functions MUST use these names. A PR that introduces labels= (instead of y_true=), scores= (instead of y_score=), alpha= (instead of confidence=), or any deviation must justify it in the PR description or rename to the canonical name.

Known deviations (recorded 2026-06-12, v1.12.0): cluster_bootstrap_ci(statistic=) (shipped v1.7.0) deviates from the canonical metric for the (y_true, y_score) -> float callable role — 7 siblings in bootstrap.py use metric. The signature is Tier-1-frozen per ADR 0003 (snapshot-pinned; a deprecation alias would itself mutate the Tier-1 signature), so the rename is queued for v2.0 in the v2.0 breaking-changes queue (#116), which also carries the related per_stratum_metric naming discussion. The eda random_state/n_bootstrap deviations flagged by the same audit were closed at v1.12.0 via the DEPRECATION.md one-time-exception path (#100).

D5 — Constants#

UPPER_SNAKE_CASE per PEP 8. Tier-1 constants in _EXPORTS include DEFAULT_SEED, DEFAULT_N_RESAMPLES, DEFAULT_CONFIDENCE, DEFAULT_METHOD, MANIFEST_SCHEMA_VERSION, CORE_TECHNIQUES, ADVANCED_TECHNIQUES, ALL_TECHNIQUES.

D6 — Protocol naming#

Protocols follow PascalCase per PEP 8 and are named semantically per ADR 0003 §1. There is no forced uniform suffix — the contract is the method shape, not the name shape. The 10 strict Tier-2 Protocols + 1 opt-in named per their role: Scorer, LeakageCheck, Splitter, ThresholdSelector, DatasetLoader, MetricSpec, TextTransform, MetaLearner, Probe, SimilarityStrategy, Versioned. (Note: SimilarityStrategy promoted from “pre-v0.7 internal” to strict Tier-2 at v1.0.2 per RC2 reconciliation; #76.)

D7 — TypeVars#

Internal TypeVars use a leading underscore per Google Python Style Guide §3.19.10: _T = TypeVar("_T"). Public, constrained TypeVars without underscore are allowed when explicitly part of an exported API.

D8 — Fitted estimator attributes (sklearn alignment)#

Estimator-style classes (fit/predict pattern) that store learned-from-data attributes use trailing underscore per scikit-learn convention: coef_, classes_, n_features_in_, feature_importances_. These attributes MUST NOT be set in __init__ — set them only in fit(). (See scikit-learn Developing estimators.)

Frozen reference-impl dataclasses (@dataclass(frozen=True, slots=True)) are exempt — they hold config, not fitted state.

stacking.LogisticStacker is the current canonical example.

D9 — Docstring style#

NumPy docstring format per STYLE.md §12. Prose wraps at 75 cols (numpydoc convention); doctest code blocks follow the 100-col Black rule. The 75-col rule keeps docstrings readable in a terminal help() call; the 100-col rule keeps code in docstrings readable in a normal editor.

D10 — Test naming#

tests/test_<module>.py mirrors src/eval_toolkit/<module>.py. Test functions are test_<thing_under_test>_<scenario>. No class-based test grouping unless fixtures truly demand it (rare in this codebase).

Industry alignment#

The conventions above were verified against canonical sources during the v0.49.0 audit:

Eval-toolkit deviates from industry conventions in four places, all intentional and documented:

  1. Unicode math identifiers (π, θ, μ, σ, α, β) are permitted in math kernels with required English-comment alias — per STYLE.md §3 and §16. PEP 8 forbids non-ASCII; eval-toolkit’s math-paper-fidelity domain justifies the exception.

  2. set_global_seeds(seed: int) keeps seed even after the v0.50.0 SPEC 7 adoption — this is a global-state setter, not a per-function RNG argument, so SPEC 7 doesn’t apply.

  3. Adversarial dataclass fields keep seed: int = 42 because they use Python’s stdlib random.Random(seed), not NumPy’s Generator. SPEC 7’s typing (RNGLike = np.random.Generator | ...) is strictly NumPy-scoped.

  4. cluster_bootstrap_ci(statistic=) keeps the scipy-flavored statistic name (scipy.stats.bootstrap uses it) despite §D4 reserving metric for this callable role — not by intent but by Tier-1 freeze; see the Known-deviations note under D4 and the v2.0 queue (#116).

Forward enforcement#

Every PR that adds a new public symbol MUST satisfy these conventions or document the exception in the PR description. There is no automated lint enforcement yet — that is a v1.x candidate (deferred N8 in the v0.49.0 plan; tracked as a future-improvement issue once the patterns have settled across a couple of v1.x minors).

References#