Test coverage floor — 70% flat with co-locked upstream-issue-filing discipline

Published

May 16, 2026

ADR-028: Test coverage floor — 70% flat with co-locked upstream-issue-filing discipline

Status

Accepted (2026-05-16). Closes the third of 4 [OPEN] rows in Phase 0-06 (§5 Code architecture + §STYLE — rows 348-351 of SPEC_GREENFIELD ledger). Companion to ADR-026 (module layout), ADR-027 (smoke vs canonical), and ADR-029 (test marker strategy).

Context

pytest --cov measures source-line coverage from the test suite; a coverage floor is a CI gate that fails the build if coverage drops below threshold. STYLE.md’s “project deltas” section explicitly defers this decision to Phase 0:

Test coverage floor: [OPEN: coverage floor; resolved at Phase 0]. eval-toolkit uses 90%; case-study composition layer typically doesn’t need foundational-library rigor.

Five options were considered:

90% — eval-toolkit parity. Contradicts the prototype-grade framing locked by ADR-027; forces low-value tests against glue.
80% — middle ground. Same low-value-test pressure as 90%, just delayed.
70% — relaxed; matches STYLE.md hint. Low enough to accommodate orchestration glue, high enough to signal that critical paths get tested.
No formal floor; measure-only — zero false positives, but no gate means coverage decay is invisible until writeup time.
Stratified floor (70% on src/, no floor on scripts/) — most pragmatic version of C; aligns with the src/-vs-scripts/ boundary locked by ADR-026.

User feedback at decision time selected option C with an additional process commitment:

“C, but whenever we find things that should be sent to runpod-deploy or eval-toolkit we should send issues to those repos for those tests.”

This addendum extends the existing library-first discipline (don’t hand-roll library primitives — file upstream issues for gaps; tracked in decisions/upstream_issues.md) from library-primitive-gaps to test-coverage-gaps. Rather than carving out scripts/ via stratification (option E), the simpler 70% flat threshold is paired with a process commitment: when a coverage gap is identified that would be better addressed upstream, file the gap upstream rather than write a low-value local test.

This is structurally cleaner than option E because:

The floor calculation is a single CLI flag (--cov-fail-under=70) — no per-package strata to maintain in .coveragerc.
The escape hatch (upstream filing) is a real process improvement — generates contribution-trail value for the writeup, not just a technical workaround.
It’s load-bearing in the right direction: the floor is a forcing function for honest engagement with each gap (either test locally, file upstream, or document non-applicability), not a license to add anti-tests.

Decision

Locked threshold and CI command

Coverage floor: 70% flat across the repo.

CI command:

uv run pytest --cov --cov-fail-under=70 --cov-report=term-missing

Updated Makefile coverage target (replaces existing ungated form pytest --cov=. --cov-report=term-missing):

coverage:
    uv run pytest --cov --cov-fail-under=70 --cov-report=term-missing

Co-locked process commitment — upstream-issue-filing for test-coverage gaps

When a local coverage gap is identified during Phase 1+ work, the developer applies this triage:

Local test is the right home — write the test locally in tests/unit/, tests/smoke/, or tests/integration/ per the Q4 marker taxonomy (ADR-029).
Upstream library is the right home — file an issue at the upstream repo (brandon-behring/eval-toolkit or brandon-behring/runpod-deploy) with:
- The proposed test pattern (sketch, not implementation).
- The rationale (why the test logic belongs upstream, not locally).
- Local file:line that depends on the absent test (if any).
- tracked label. Then add a row to decisions/upstream_issues.md with the issue URL + [test-coverage-gap] tag + local file:line.
Genuinely not testable, not upstream-applicable — leave a # noqa: COV style code comment with rationale + add a decisions/upstream_issues.md row tagged [not-applicable] documenting the deferral.

A workaround that ignores the gap (lets coverage drop below 70% silently, or worse, adds a no-op test to inflate coverage) is an anti-pattern.

Updated `decisions/upstream_issues.md` ledger conventions

The ledger’s “How to use this ledger” section gains:

Test-coverage-gap entries: when a coverage gap is filed upstream rather than tested locally, add a row with the [test-coverage-gap] tag in the Title column. When a gap is documented as not-applicable rather than filed, use the [not-applicable] tag. Both forms preserve the discipline trail without forcing local anti-tests.

A worked example row (placeholder — populated as actual gaps surface during Phase 1+):

Date	Repo	Issue #	Title	Local site (file:line)	Status
`[TBD-at-Phase-1-entry]`	`brandon-behring/eval-toolkit`	`[TBD]`	`[test-coverage-gap]` example placeholder for upstream-filing convention	`[TBD]`	placeholder

What’s explicitly out of scope

Per-package coverage strata (option E) — kept simple at 70% flat; the upstream-filing escape hatch handles the orchestration-glue case structurally rather than via configuration.
Production-grade coverage (≥85%) — currently out-of-scope per ADR-027 prototype-grade framing; reopen via superseding ADR if scope extends.
Coverage on tests/ itself — pytest-cov’s default behavior excludes test files when --cov has no path argument; this is preserved.

Consequences

Positive

Single CLI flag for the floor — no .coveragerc per-package configuration to drift.
Floor as forcing function for honest engagement: every gap gets one of three responses (local test / upstream issue / documented deferral). No silent coverage decay.
Contribution trail: upstream-filed test-coverage gaps generate decisions/upstream_issues.md entries that double as evidence of library-first discipline for the writeup.
Aligns with ADR-027 prototype-grade framing: 70% is the right number for debugging-grade testing; 90% would force the layer to do work that belongs upstream.

Negative

70% is empirically chosen, not derived. If Phase 1 reveals it’s either trivially exceeded everywhere (suggesting we should raise it) or chronically failed on legitimate glue (suggesting we should rethink), reopen via ADR with the actual data.
Upstream filing has friction: writing the issue + the proposed test pattern takes more developer time than writing a local anti-test. This is the intended friction — it forces a real “is this our problem or theirs” judgment rather than a reflex anti-test.
CI gate fires on coverage drops even when the absolute level is still above 70%. Acceptable cost — drops are signal worth investigating; the upstream-filing escape hatch handles the legitimate-deferral case.

Limitation

The 70% threshold is a heuristic, not a methodological commitment. It is calibrated for the prototype-grade context locked by ADR-027 (where the math lives upstream). If a Phase 1+ surprise reveals the threshold is mis-calibrated, the data justifies the superseding ADR — not a quiet adjustment.

Extension condition for revisit

Production-deployment scope extension lifts floor to 85% with src/eval/ at 90% (since src/eval/ is the most math-adjacent package and would carry production-grade calibration / threshold-fitting orchestration logic in that context).
First sustained Phase 1 violation of the floor with no viable upstream home triggers a re-evaluation — possibly drop to 60% (with explanation), or move to the stratified option E variant.

Alternatives considered

(A) 90% — rejected; contradicts prototype-grade framing; forces low-value tests.
(B) 80% — rejected; arbitrary middle ground with same low-value-test pressure delayed.
(D) No floor; measure-only — rejected; loses the forcing-function benefit; coverage decay invisible until writeup.
(E) Stratified 70% on src/, no floor on scripts/ — rejected in favor of flat 70% + upstream-filing escape hatch; the escape hatch is structurally cleaner than per-package strata and provides real process value.