ADR 0004: Naming conventions#
Status: Accepted at v0.49.0 (the final pre-v1.0 minor) — these conventions are the v1.0 contract for naming.
Date: 2026-05-23
Authors: brandon-behring
Context: Codified out of an audit + industry-research pass run during
~/.claude/plans/what-git-issues-are-bright-torvalds.md planning.
Context#
eval-toolkit at v0.48.0 had 95–99% consistent naming across the public surface — a real success, not an accident — but the conventions were implicit rather than documented. v1.0 locks the Tier-1 API per ADR 0003; any inconsistency that ships at v1.0 lives forever (until v2.0). This ADR both documents what’s already true AND closes the small remaining gaps flagged by the v0.49.0 audit.
The audit also surfaced a v1.0-critical RNG-parameter inconsistency
(seed in ~15 functions, random_state in stacking.py) that requires
adoption of Scientific Python SPEC 7.
That adoption ships in v0.50.0 as a focused release; this ADR documents
the convention now so it locks the rule going forward.
Decisions#
D1 — Module naming#
Modules in src/eval_toolkit/*.py are flat per
ADR 0001. Within that constraint:
Plural noun for collection-of-types modules — modules holding multiple related types of the same kind:
metrics,loaders,protocols,losses,probes,splits,paths,seeds,thresholds,artifacts,claims,embeddings,scorecards(new at v0.49.0; was_scorecard).Singular noun for domain-concept modules — modules holding a single coherent concept:
harness,bootstrap,manifest,calibration,leakage,analysis,provenance,evidence,stacking,text_dedup.Gerund for process-domain modules — modules describing an operation:
preprocessing.Private modules carry a leading underscore:
_parallel.py,_deprecated.py,_version.py,_rng.py,_sweep.py. These are not in_EXPORTS; their public symbols (if any) are accessed only via top-levelfrom eval_toolkit import X.
The asymmetric-promotion sub-rule from ADR 0001 controls when a private module should be promoted to public: collection-of-types MAY promote; single-function-only SHOULD stay underscore.
D2 — Class naming#
Classes follow PascalCase per PEP 8. Within that, eval-toolkit uses
domain-specific suffixes that map to Protocol contracts:
Suffix |
Domain |
Examples |
Protocol |
|---|---|---|---|
|
Threshold selection |
|
|
|
Cross-validation splits |
|
|
|
Leakage detection |
|
|
|
Dataset loading |
|
|
|
Prediction artifact reading |
|
|
|
Preprocessing variant |
|
(functional API) |
|
Dedup similarity backend |
|
|
|
Adversarial char-injection attack |
|
|
|
Adversarial char-substitution attack |
|
|
Result/output dataclasses use the suffix *Result, *CI,
*Estimate, or *Report where the type genuinely is a result object;
exception cases (Scorecard, WilsonInterval) are intentional —
Scorecard is the named domain concept, not a generic result, and
WilsonInterval is the math-term name. Document exceptions in the
class docstring.
Config/metadata dataclasses use *Spec, *Metadata, or
*Manifest per the domain.
D3 — Function naming#
Functions use snake_case per PEP 8. Prefer verb-prefix for
factories when the name is action-shaped:
Prefix |
Use |
Examples |
|---|---|---|
|
Construct an object |
|
|
Fit a model/calibrator |
|
|
Run an evaluation harness |
|
|
Render a figure |
|
|
Serialize to disk |
|
|
Check well-formedness |
|
|
Resample-based CI |
|
|
Paired-bootstrap difference |
|
|
I/O read |
|
|
Pure derivation |
|
Noun-form is OK for scalar metric functions per
ADR 0002: pr_auc,
roc_auc, brier_score, recall_at_fpr, wilson_interval. These
are sklearn-aligned and would be awkward as compute_pr_auc(...).
D4 — Parameter naming (the canonical list)#
Locked at v1.0 — these names mean these things, everywhere:
Parameter |
Meaning |
|---|---|
|
Ground-truth labels (binary, shape |
|
Continuous score / probability (shape |
|
Discrete prediction (when threshold-dependent) |
|
Bootstrap iteration count |
|
Two-sided confidence level (0.95 default) |
|
Binning count for calibration / ECE |
|
Parallelism (joblib convention; sklearn-aligned) |
|
Matplotlib axis (matplotlib convention) |
|
Callable |
|
RNG argument per SPEC 7 — canonical convention (adopted v0.50.0). Accepts |
|
legacy name used through v0.49 — replaced by |
Future functions MUST use these names. A PR that introduces
labels= (instead of y_true=), scores= (instead of y_score=),
alpha= (instead of confidence=), or any deviation must justify it
in the PR description or rename to the canonical name.
Known deviations (recorded 2026-06-12, v1.12.0):
cluster_bootstrap_ci(statistic=) (shipped v1.7.0) deviates from the
canonical metric for the (y_true, y_score) -> float callable role
— 7 siblings in bootstrap.py use metric. The signature is
Tier-1-frozen per ADR 0003
(snapshot-pinned; a deprecation alias would itself mutate the Tier-1
signature), so the rename is queued for v2.0 in the
v2.0 breaking-changes queue (#116),
which also carries the related per_stratum_metric naming
discussion. The eda random_state/n_bootstrap deviations flagged
by the same audit were closed at v1.12.0 via the DEPRECATION.md
one-time-exception path (#100).
D5 — Constants#
UPPER_SNAKE_CASE per PEP 8. Tier-1 constants in _EXPORTS include
DEFAULT_SEED, DEFAULT_N_RESAMPLES, DEFAULT_CONFIDENCE,
DEFAULT_METHOD, MANIFEST_SCHEMA_VERSION, CORE_TECHNIQUES,
ADVANCED_TECHNIQUES, ALL_TECHNIQUES.
D6 — Protocol naming#
Protocols follow PascalCase per PEP 8 and are named semantically
per ADR 0003 §1.
There is no forced uniform suffix — the contract is the method
shape, not the name shape. The 10 strict Tier-2 Protocols + 1 opt-in
named per their role: Scorer, LeakageCheck, Splitter,
ThresholdSelector, DatasetLoader, MetricSpec, TextTransform,
MetaLearner, Probe, SimilarityStrategy, Versioned. (Note:
SimilarityStrategy promoted from “pre-v0.7 internal” to strict
Tier-2 at v1.0.2 per RC2 reconciliation; #76.)
D7 — TypeVars#
Internal TypeVars use a leading underscore per Google Python Style
Guide §3.19.10: _T = TypeVar("_T"). Public, constrained TypeVars
without underscore are allowed when explicitly part of an exported API.
D8 — Fitted estimator attributes (sklearn alignment)#
Estimator-style classes (fit/predict pattern) that store
learned-from-data attributes use trailing underscore per
scikit-learn convention: coef_, classes_, n_features_in_,
feature_importances_. These attributes MUST NOT be set in
__init__ — set them only in fit(). (See
scikit-learn Developing estimators.)
Frozen reference-impl dataclasses (@dataclass(frozen=True, slots=True))
are exempt — they hold config, not fitted state.
stacking.LogisticStacker is the current canonical example.
D9 — Docstring style#
NumPy docstring format per STYLE.md §12. Prose wraps at 75 cols
(numpydoc convention); doctest code blocks follow the 100-col Black
rule. The 75-col rule keeps docstrings readable in a terminal help()
call; the 100-col rule keeps code in docstrings readable in a normal
editor.
D10 — Test naming#
tests/test_<module>.py mirrors src/eval_toolkit/<module>.py. Test
functions are test_<thing_under_test>_<scenario>. No class-based
test grouping unless fixtures truly demand it (rare in this codebase).
Industry alignment#
The conventions above were verified against canonical sources during the v0.49.0 audit:
PEP 8 — Style Guide for Python Code — module/class/function/variable/constant naming, leading-underscore privacy.
scikit-learn Developing estimators — fit/predict contract, trailing-underscore for fitted attributes, parameter names
y_true/y_pred/y_score/n_jobs.numpydoc Format Spec — docstring sections + 75-col prose rule.
Google Python Style Guide — TypeVar leading-underscore (§3.19.10) for internal types.
Scientific Python SPEC 7 — Seeding pseudo-random number generation —
rng: RNGLike | SeedLike | Noneparameter convention (adopted in v0.50.0; documented here so the rule locks now).
Eval-toolkit deviates from industry conventions in four places, all intentional and documented:
Unicode math identifiers (
π,θ,μ,σ,α,β) are permitted in math kernels with required English-comment alias — per STYLE.md §3 and §16. PEP 8 forbids non-ASCII; eval-toolkit’s math-paper-fidelity domain justifies the exception.set_global_seeds(seed: int)keepsseedeven after the v0.50.0 SPEC 7 adoption — this is a global-state setter, not a per-function RNG argument, so SPEC 7 doesn’t apply.Adversarial dataclass fields keep
seed: int = 42because they use Python’s stdlibrandom.Random(seed), not NumPy’sGenerator. SPEC 7’s typing (RNGLike = np.random.Generator | ...) is strictly NumPy-scoped.cluster_bootstrap_ci(statistic=)keeps the scipy-flavoredstatisticname (scipy.stats.bootstrap uses it) despite §D4 reservingmetricfor this callable role — not by intent but by Tier-1 freeze; see the Known-deviations note under D4 and the v2.0 queue (#116).
Forward enforcement#
Every PR that adds a new public symbol MUST satisfy these conventions or document the exception in the PR description. There is no automated lint enforcement yet — that is a v1.x candidate (deferred N8 in the v0.49.0 plan; tracked as a future-improvement issue once the patterns have settled across a couple of v1.x minors).
References#
ADR 0001 — Flat module layout (cross-references the asymmetric-promotion principle).
ADR 0003 — Stability contract + Gate 3 methodology (defines the Tier-1/2/3 framework these naming rules operate inside).
STYLE.md(root) — contributor-facing daily reference; this ADR is the decision record.~/.claude/plans/what-git-issues-are-bright-torvalds.md— the v0.49.0 plan that produced this ADR.