Test marker strategy — ratify 4-marker stratification (unit / smoke / integration / network)
ADR-029: Test marker strategy — ratify 4-marker stratification
Status
Accepted (2026-05-16). Closes the fourth and final [OPEN] row in Phase 0-06 (§5 Code architecture + §STYLE — rows 348-351 of SPEC_GREENFIELD ledger). Companion to ADR-026 (module layout), ADR-027 (smoke vs canonical), and ADR-028 (coverage floor).
Context
pytest markers (@pytest.mark.unit, @pytest.mark.smoke, etc.) let the suite be sliced by intent — pytest -m unit runs only fast deterministic tests; pytest -m integration runs GPU/network-dependent ones. Markers must be registered (in conftest.py or pyproject.toml) for --strict-markers to allow them; unregistered markers fail loudly.
The current state of the marker taxonomy is provisional but consistent across three artefacts:
pyproject.toml [tool.pytest.ini_options]registersunit,smoke,integration,network(4 markers) withaddopts = "-v --tb=short --strict-markers".tests/conftest.pymirrors the same 4 markers viapytest_configure+addinivalue_linecalls.STYLE.md“project deltas” section documents the 4-marker stratification.
The Q2 reframing (per ADR-027) — “math rigor lives upstream in eval-toolkit; here is debugging-grade” — pre-decides that golden and property markers should NOT be added here. They belong upstream where the math implementations and their golden contracts live.
Five options were considered:
Ratify existing 4 markers — zero churn; honors ADR-027 framing.
4 + slow (5 markers) — adds
slowfor tests >30s. Redundant withsmoke(smoke is already “~5min, end-to-end”).3 markers (drop network) — simplifies by removing currently-unused marker. Wastes an ADR cycle when first network-dependent test lands at Phase 1.
4 + gpu (5 markers) — separates CUDA from generic integration. Marker proliferation;
pytest.importorskip+skipifidiom handles it without a marker.5 markers with both gpu and network — hybrid of A and D. Same proliferation cost as D.
Decision
Locked marker taxonomy — exactly 4 markers
| Marker | Registered location | Semantics | Wall-clock target | Allowed external deps |
|---|---|---|---|---|
unit |
pyproject + conftest | fast, deterministic, no IO | < 1 sec/test | none |
smoke |
pyproject + conftest | end-to-end fixture-data pass | < 10 min total | none (no GPU, no network) |
integration |
pyproject + conftest | exercises real external deps; may skip via importorskip/skipif | ~5-10 min | GPU, HF Hub, RunPod (per pre-flight per ADR-027) |
network |
pyproject + conftest | strictly requires network access | varies | network only (HF Hub fetch, runpod-deploy GraphQL) |
--strict-markers enabled
Already enabled in pyproject.toml [tool.pytest.ini_options] addopts = "-v --tb=short --strict-markers". Unknown markers fail loudly — prevents typos like @pytest.mark.itegration from silently registering as a new marker.
Two-source registration (pyproject + conftest)
Both pyproject.toml AND tests/conftest.py register the markers. Reasons:
- pyproject is canonical (read by pytest at config time; visible to PEP-621-aware tools).
- conftest mirrors for IDE discoverability (PyCharm + VS Code pytest integrations introspect
pytest_configurecalls; descriptions surface in the IDE marker dropdown).
The two MUST stay in sync — invariant test test_pytest_markers_registered_and_in_sync enforces.
Markers explicitly NOT added (with rationale)
property: Hypothesis property-based tests belong upstream in eval-toolkit (where math kernels live and where the strategy libraryeval_toolkitalready uses Hypothesis withhypothesis.extra.numpy). Addingpropertyhere would either duplicate upstream tests or pretend project-specific math exists when it does not.golden: golden-output snapshot tests (where the output IS the contract) belong upstream — eval-toolkit usesgoldenfordocs.pyoutput. This repo has no docs-output contract surface; results.json is structural-but-not-byte-exact (per-row predictions vary by RNG seed).slow: redundant withsmokefor end-to-end tests; if aunittest crosses 30s, that’s a code-smell the marker should not paper over (the test should be re-classified as smoke or refactored).gpu: standard pattern below handles GPU-conditional skipping cleanly without taxonomy proliferation.
import pytest
import torch
@pytest.mark.integration
def test_modernbert_load_on_gpu() -> None:
pytest.importorskip("torch")
if not torch.cuda.is_available():
pytest.skip("GPU required")
# ... actual testIf Phase 1 reveals chronic friction with this pattern (e.g., >5 tests need it), reopen via ADR to add gpu sub-marker.
Marker-add or marker-remove protocol
Adding or removing a marker post-lock requires a superseding ADR with rationale (e.g., “Phase 1 produced 8 GPU-conditional tests; the importorskip+skipif boilerplate is paying real cost; adding gpu sub-marker”). Quietly editing pyproject.toml or conftest.py is an anti-pattern (the in-sync invariant test would catch it but the supersession-without-ADR pattern is what the SDD discipline forbids).
Consequences
Positive
- Zero churn: existing artefacts (
pyproject.toml,conftest.py,STYLE.md,Makefiletest-unit/test-smoke/test-integrationtargets) are already aligned with the locked taxonomy. --strict-markerscatches typos: any@pytest.mark.<typo>fails loudly at test-collection time.- Aligns with ADR-027 framing: math-correctness rigor (
property,golden) explicitly stays upstream; this layer is debugging-grade. - Two-source registration matches IDE expectations without sacrificing canonical pyproject declaration.
Negative
networkmarker not currently used at lock time — risk of dead-letter taxonomy. Mitigated by Phase 1 expectations: HF dataset SHA-pinning tests (per ADR-016) will land markednetwork.- No
gpusub-marker means GPU-conditional tests carry theimportorskip+skipifboilerplate. Acceptable cost for prototype scope; reopen if boilerplate proliferates. slowabsence means aunittest that grows to 30s+ has nowhere to escape to without re-classification. Treated as a feature, not a bug.
Limitation
The 4-marker strata do not capture every cross-cutting concern (no slow, no gpu, no flaky, no property). The discipline relies on pytest.mark.skipif + pytest.importorskip to handle conditional skipping within a marker. If this pattern produces verbose boilerplate at Phase 1 (~5+ tests), reopen via superseding ADR.
Extension condition for revisit
gpusub-marker: if conditionalskipif(not torch.cuda.is_available())boilerplate appears in 5+ tests.slow: if anyunittest legitimately crosses 30s and cannot be split or re-marked as smoke.property: only if scope extends to writing project-specific math primitives (currently out-of-scope per ADR-027 prototype-grade framing; math kernels live upstream).flaky: only if Phase 1 reveals genuinely-flaky tests that cannot be made deterministic; preference is to fix the flakiness, not paper over with a marker.
Alternatives considered
- (B) 4 + slow — rejected; redundant with
smoke; absence is a feature. - (C) 3 markers (drop network) — rejected; HF SHA-pin tests will need it at Phase 1; pre-registering avoids ADR-cycle waste.
- (D) 4 + gpu — rejected; standard
importorskip + skipifidiom handles it; reopen if boilerplate proliferates. - (E) 5 with gpu + network — rejected; same proliferation cost as D.