Test marker strategy — ratify 4-marker stratification (unit / smoke / integration / network)

Published

May 16, 2026

ADR-029: Test marker strategy — ratify 4-marker stratification

Status

Accepted (2026-05-16). Closes the fourth and final [OPEN] row in Phase 0-06 (§5 Code architecture + §STYLE — rows 348-351 of SPEC_GREENFIELD ledger). Companion to ADR-026 (module layout), ADR-027 (smoke vs canonical), and ADR-028 (coverage floor).

Context

pytest markers (@pytest.mark.unit, @pytest.mark.smoke, etc.) let the suite be sliced by intent — pytest -m unit runs only fast deterministic tests; pytest -m integration runs GPU/network-dependent ones. Markers must be registered (in conftest.py or pyproject.toml) for --strict-markers to allow them; unregistered markers fail loudly.

The current state of the marker taxonomy is provisional but consistent across three artefacts:

pyproject.toml [tool.pytest.ini_options] registers unit, smoke, integration, network (4 markers) with addopts = "-v --tb=short --strict-markers".
tests/conftest.py mirrors the same 4 markers via pytest_configure + addinivalue_line calls.
STYLE.md “project deltas” section documents the 4-marker stratification.

The Q2 reframing (per ADR-027) — “math rigor lives upstream in eval-toolkit; here is debugging-grade” — pre-decides that golden and property markers should NOT be added here. They belong upstream where the math implementations and their golden contracts live.

Five options were considered:

Ratify existing 4 markers — zero churn; honors ADR-027 framing.
4 + slow (5 markers) — adds slow for tests >30s. Redundant with smoke (smoke is already “~5min, end-to-end”).
3 markers (drop network) — simplifies by removing currently-unused marker. Wastes an ADR cycle when first network-dependent test lands at Phase 1.
4 + gpu (5 markers) — separates CUDA from generic integration. Marker proliferation; pytest.importorskip + skipif idiom handles it without a marker.
5 markers with both gpu and network — hybrid of A and D. Same proliferation cost as D.

Decision

Locked marker taxonomy — exactly 4 markers

Marker	Registered location	Semantics	Wall-clock target	Allowed external deps
`unit`	pyproject + conftest	fast, deterministic, no IO	< 1 sec/test	none
`smoke`	pyproject + conftest	end-to-end fixture-data pass	< 10 min total	none (no GPU, no network)
`integration`	pyproject + conftest	exercises real external deps; may skip via importorskip/skipif	~5-10 min	GPU, HF Hub, RunPod (per pre-flight per ADR-027)
`network`	pyproject + conftest	strictly requires network access	varies	network only (HF Hub fetch, runpod-deploy GraphQL)

`--strict-markers` enabled

Already enabled in pyproject.toml [tool.pytest.ini_options] addopts = "-v --tb=short --strict-markers". Unknown markers fail loudly — prevents typos like @pytest.mark.itegration from silently registering as a new marker.

Two-source registration (pyproject + conftest)

Both pyproject.toml AND tests/conftest.py register the markers. Reasons:

pyproject is canonical (read by pytest at config time; visible to PEP-621-aware tools).
conftest mirrors for IDE discoverability (PyCharm + VS Code pytest integrations introspect pytest_configure calls; descriptions surface in the IDE marker dropdown).

The two MUST stay in sync — invariant test test_pytest_markers_registered_and_in_sync enforces.

Markers explicitly NOT added (with rationale)

property: Hypothesis property-based tests belong upstream in eval-toolkit (where math kernels live and where the strategy library eval_toolkit already uses Hypothesis with hypothesis.extra.numpy). Adding property here would either duplicate upstream tests or pretend project-specific math exists when it does not.
golden: golden-output snapshot tests (where the output IS the contract) belong upstream — eval-toolkit uses golden for docs.py output. This repo has no docs-output contract surface; results.json is structural-but-not-byte-exact (per-row predictions vary by RNG seed).
slow: redundant with smoke for end-to-end tests; if a unit test crosses 30s, that’s a code-smell the marker should not paper over (the test should be re-classified as smoke or refactored).
gpu: standard pattern below handles GPU-conditional skipping cleanly without taxonomy proliferation.

import pytest
import torch

@pytest.mark.integration
def test_modernbert_load_on_gpu() -> None:
    pytest.importorskip("torch")
    if not torch.cuda.is_available():
        pytest.skip("GPU required")
    # ... actual test

If Phase 1 reveals chronic friction with this pattern (e.g., >5 tests need it), reopen via ADR to add gpu sub-marker.

Marker-add or marker-remove protocol

Adding or removing a marker post-lock requires a superseding ADR with rationale (e.g., “Phase 1 produced 8 GPU-conditional tests; the importorskip+skipif boilerplate is paying real cost; adding gpu sub-marker”). Quietly editing pyproject.toml or conftest.py is an anti-pattern (the in-sync invariant test would catch it but the supersession-without-ADR pattern is what the SDD discipline forbids).

Consequences

Positive

Zero churn: existing artefacts (pyproject.toml, conftest.py, STYLE.md, Makefile test-unit/test-smoke/test-integration targets) are already aligned with the locked taxonomy.
--strict-markers catches typos: any @pytest.mark.<typo> fails loudly at test-collection time.
Aligns with ADR-027 framing: math-correctness rigor (property, golden) explicitly stays upstream; this layer is debugging-grade.
Two-source registration matches IDE expectations without sacrificing canonical pyproject declaration.

Negative

network marker not currently used at lock time — risk of dead-letter taxonomy. Mitigated by Phase 1 expectations: HF dataset SHA-pinning tests (per ADR-016) will land marked network.
No gpu sub-marker means GPU-conditional tests carry the importorskip + skipif boilerplate. Acceptable cost for prototype scope; reopen if boilerplate proliferates.
slow absence means a unit test that grows to 30s+ has nowhere to escape to without re-classification. Treated as a feature, not a bug.

Limitation

The 4-marker strata do not capture every cross-cutting concern (no slow, no gpu, no flaky, no property). The discipline relies on pytest.mark.skipif + pytest.importorskip to handle conditional skipping within a marker. If this pattern produces verbose boilerplate at Phase 1 (~5+ tests), reopen via superseding ADR.

Extension condition for revisit

gpu sub-marker: if conditional skipif(not torch.cuda.is_available()) boilerplate appears in 5+ tests.
slow: if any unit test legitimately crosses 30s and cannot be split or re-marked as smoke.
property: only if scope extends to writing project-specific math primitives (currently out-of-scope per ADR-027 prototype-grade framing; math kernels live upstream).
flaky: only if Phase 1 reveals genuinely-flaky tests that cannot be made deterministic; preference is to fix the flakiness, not paper over with a marker.

Alternatives considered

(B) 4 + slow — rejected; redundant with smoke; absence is a feature.
(C) 3 markers (drop network) — rejected; HF SHA-pin tests will need it at Phase 1; pre-registering avoids ADR-cycle waste.
(D) 4 + gpu — rejected; standard importorskip + skipif idiom handles it; reopen if boilerplate proliferates.
(E) 5 with gpu + network — rejected; same proliferation cost as D.