Smoke vs canonical separation — three Makefile targets stratified by execution context
ADR-027: Smoke vs canonical separation — three Makefile targets stratified by execution context
Status
Accepted (2026-05-16). Closes the second of 4 [OPEN] rows in Phase 0-06 (§5 Code architecture + §STYLE — rows 348-351 of SPEC_GREENFIELD ledger). Companion to ADR-026 (module layout), ADR-028 (coverage floor), and ADR-029 (test marker strategy).
Context
SPEC_GREENFIELD §5 says smoke must run in less than 10 minutes on a laptop without GPU; canonical must reproduce headline numbers from a published config. The boundary determines what a reviewer can verify locally vs needs cloud access for. SPEC §6 says “a stranger can clone, install, and reproduce headline numbers via documented commands.” Three options were considered for the surface that separates the two:
Single profile-switched target —
make eval PROFILE={fixtures|full}. One CLI surface; profile flag selects config. Hides the cost asymmetry (PROFILE=fulltypo on laptop = OOM crash + potential billed call); reviewer reading the Makefile must grok profile system before knowing which is cheap.Two distinct targets (
make smoke,make headline-cloud) — spec default. Visual cost asymmetry; matches existing partial Makefile state; matches eval-toolkit Makefile precedent.Three tiers (laptop smoke / CPU full / GPU canonical) — gives reviewer a third middle option for reference-rung-only reproduction. Three configs to maintain;
full-cpuis a half-measure that doesn’t reproduce headline numbers.
User feedback at decision time reframed the question:
“This is a prototype, not a production deployment. The golden-evals and thorough testing should be done in eval-toolkit where the math work resides. Smoke tests here and lightweight integration tests using the local GPU are useful for debugging as well as on the cloud GPU for smoke tests, but we are not claiming to have a rigorous deployment.”
This reframing decouples three things that the original three options conflated:
- Math-rigor production-grade testing (Hypothesis property tests, golden-output snapshots, ≥90% coverage on math kernels) belongs upstream in eval-toolkit where the math implementations live. Re-doing it here would duplicate work AND mislead reviewers into thinking this repo’s test layer validates math correctness.
- Debugging-grade local testing (smoke + lightweight integration) belongs here. Sufficient to catch breakage before paying for cloud time.
- Canonical evaluation orchestration (the actual headline-cloud run) is a separate concept from testing. It is the deliverable, not a verification of the deliverable. Its Makefile target wraps runpod-deploy primitives but is not part of the test taxonomy.
The reframing produces a stratified-by-execution-context variant of (B): three targets but only two are tests.
Decision
Three Makefile targets, stratified by execution context
| Target | Execution context | Compute | Network | Wall-clock budget | Purpose |
|---|---|---|---|---|---|
make smoke |
laptop only | no GPU | no network | less than 10 min | dev debugging + reviewer “does this wire together” check |
make test-integration |
local GPU OR cloud pod | GPU when available; skip gracefully when not | optional | less than ~10 min | dev debugging on workstation GPU; pre-flight smoke on cloud pod before headline-cloud |
make headline-cloud |
RunPod (billed) | H100/equivalent per ADR-020 gpu_order failover | required | hours; cost-cap-gated at $125/job per ADR-020 + A-002 | canonical evaluation deliverable — not a test |
Target details
make smoke (already partially implemented — test-smoke target exists)
test-smoke:
uv run pytest -m smoke -q
To be augmented at Phase 1 with a fixture-data end-to-end pass:
smoke: test-smoke
uv run python scripts/run_metrics_battery.py \
--config configs/profiles/fixtures.yaml \
--output evals/smoke/results.json
Constraints:
- No GPU dependencies (must execute on a laptop with
torchavailable but no CUDA device). - No network calls (fixture data lives in
tests/fixtures/; HF dataset SHAs not fetched in smoke; LLM-judge calls mocked). - Total wall-clock ≤ 10 min on a laptop (target ~5 min).
make test-integration (already partially implemented — test-integration target exists)
test-integration:
uv run pytest -m integration -q
GPU-awareness pattern (the integration tests themselves):
import pytest
import torch
@pytest.mark.integration
def test_modernbert_load_on_gpu() -> None:
pytest.importorskip("torch")
if not torch.cuda.is_available():
pytest.skip("GPU required")
# ... actual testDual execution contexts:
- Locally on developer workstation GPU:
make test-integrationruns the GPU tests if a CUDA device is available; skips them otherwise. Useful for debugging trained-rung paths without paying for cloud time. - On cloud pod as pre-flight smoke: same
make test-integrationinvocation runs as part of the cloud pod startup sequence, beforemake headline-cloudproceeds. Validates that the cloud env (CUDA driver, flash-attn fallback recipe per ADR-020, HF cache, secrets injection) works end-to-end on the rented hardware before billing the canonical run.
make headline-cloud (placeholder target until Phase 1 lands configs/runpod/headline.yaml)
headline-cloud:
runpod-deploy validate --all
runpod-deploy run --dry-run --config configs/runpod/headline.yaml
@read -p "Approve canonical run? [y/N] " ans && [ "$$ans" = "y" ] || exit 1
runpod-deploy run --config configs/runpod/headline.yaml
Cost-cap discipline (per ADR-020 + A-002):
validate --allenforces preflight schema + DC reachability + GPU stock check before any billing.--dry-runproduces cost preview without provisioning (hits runpodctl + GraphQL pricing).- Interactive approval gate prevents accidental invocation.
pod.gpu_order8-class failover ladder +budget.cost_cap_usd=125perconfigs/runpod/headline.yamlper ADR-020.
make headline-dry-run (placeholder target until Phase 1)
Standalone cost preview without the canonical run:
headline-dry-run:
runpod-deploy validate --all
runpod-deploy run --dry-run --config configs/runpod/headline.yaml
Useful pre-flight when revising the canonical config.
Honest framing — debugging-grade here, rigorous upstream
WRITEUP/methodology.md (Phase 5 deliverable) is required to carry a paragraph documenting the testing-rigor split:
“Math-correctness validation lives upstream in eval-toolkit (≥90% coverage floor, Hypothesis property tests, golden-output snapshots, doctests on math kernels). The local test layer in this prototype repo is debugging-grade — sufficient to catch glue-layer breakage and validate orchestration end-to-end before paying for cloud compute, not sufficient to substitute for upstream library validation. Reviewers should consult eval-toolkit’s test suite for math-correctness evidence; this repo’s test suite covers project-specific glue (data loaders, dedup calibration, reference-scorer adapters, threshold-fitting orchestrators).”
This honesty matters because the alternative (re-running upstream math tests here, or claiming our debugging-grade tests validate methodology) would either duplicate work or mislead reviewers about what was verified.
Out-of-scope explicitly
- Production-grade test rigor in this repo (Hypothesis property tests, golden-output snapshots) — belongs upstream; if scope extends to production deployment, reopen via superseding ADR.
- Re-implementation of upstream library tests in this repo — anti-pattern (duplicates work, drifts from upstream, wastes review cycles).
- A “full-CPU” middle tier (Option C) — YAGNI for prototype; reference-rung-only ad-hoc reproduction is a curious-reviewer manual invocation, not a Makefile contract surface.
Consequences
Positive
- Visual cost asymmetry preserved:
make smokeandmake test-integrationare obviously safe;make headline-cloudis obviously billed and gated by interactive approval. - Dual-execution-context integration tests get double duty: same code, same Makefile target — runs locally for dev iteration AND on cloud pod for pre-flight smoke. No per-context boilerplate.
- Honest framing kills the “did your tests prove methodology correctness” reviewer question: WRITEUP/methodology.md paragraph defuses it upfront.
- Existing Makefile targets ratified:
test-smokeandtest-integrationare unchanged; onlyheadline-cloud+headline-dry-runneed adding (as placeholders until Phase 1 producesconfigs/runpod/headline.yaml).
Negative
make smokecannot validate trained-rung numbers — model weights too large for laptop, training requires GPU. Smoke validates glue and orchestration only.make test-integrationrequirestorchat install time even on CPU-only laptops (for thepytest.importorskipto succeed; tests skip after that). Acceptable cost —torchis a project dependency anyway.- Cost-cap pre-flight depends on runpod-deploy being available locally before invocation; CI cannot run
make headline-cloud(by design — canonical runs are operator-initiated, not CI-triggered).
Limitation
The 10-minute smoke budget and ~10-minute integration budget are empirical caps. If Phase 1 reveals smoke creeps to 15 min or integration to 30 min, reopen via ADR with the actual data. The caps are heuristics for reviewer-friendly fast-iteration loops, not load-bearing methodological commitments.
Extension condition for revisit
If scope extends to production deployment (currently out-of-scope per the user reframing at decision time), add the production-grade test tier via superseding ADR — likely Hypothesis property tests on project-specific math (if any exists at that point) and golden-output snapshots for the canonical headline JSON shape.
Alternatives considered
- (A) Single profile-switched target — rejected; hides cost asymmetry; typo-risk.
- (B-as-stated) Two distinct targets without integration middle — superseded by stratified-by-execution-context variant (this ADR); the integration tier is genuinely useful for both local debugging and cloud pre-flight.
- (C) Three tiers with
full-cpumiddle — rejected; YAGNI; reference-rung-only reproduction is ad-hoc, not a contract surface. - Rerun upstream math tests in this repo — explicitly rejected per the user reframing; math rigor lives upstream where the math lives.