Test coverage floor — 70% flat with co-locked upstream-issue-filing discipline
ADR-028: Test coverage floor — 70% flat with co-locked upstream-issue-filing discipline
Status
Accepted (2026-05-16). Closes the third of 4 [OPEN] rows in Phase 0-06 (§5 Code architecture + §STYLE — rows 348-351 of SPEC_GREENFIELD ledger). Companion to ADR-026 (module layout), ADR-027 (smoke vs canonical), and ADR-029 (test marker strategy).
Context
pytest --cov measures source-line coverage from the test suite; a coverage floor is a CI gate that fails the build if coverage drops below threshold. STYLE.md’s “project deltas” section explicitly defers this decision to Phase 0:
Test coverage floor:
[OPEN: coverage floor; resolved at Phase 0]. eval-toolkit uses 90%; case-study composition layer typically doesn’t need foundational-library rigor.
Five options were considered:
90% — eval-toolkit parity. Contradicts the prototype-grade framing locked by ADR-027; forces low-value tests against glue.
80% — middle ground. Same low-value-test pressure as 90%, just delayed.
70% — relaxed; matches STYLE.md hint. Low enough to accommodate orchestration glue, high enough to signal that critical paths get tested.
No formal floor; measure-only — zero false positives, but no gate means coverage decay is invisible until writeup time.
Stratified floor (70% on src/, no floor on scripts/) — most pragmatic version of C; aligns with the src/-vs-scripts/ boundary locked by ADR-026.
User feedback at decision time selected option C with an additional process commitment:
“C, but whenever we find things that should be sent to runpod-deploy or eval-toolkit we should send issues to those repos for those tests.”
This addendum extends the existing library-first discipline (don’t hand-roll library primitives — file upstream issues for gaps; tracked in decisions/upstream_issues.md) from library-primitive-gaps to test-coverage-gaps. Rather than carving out scripts/ via stratification (option E), the simpler 70% flat threshold is paired with a process commitment: when a coverage gap is identified that would be better addressed upstream, file the gap upstream rather than write a low-value local test.
This is structurally cleaner than option E because:
- The floor calculation is a single CLI flag (
--cov-fail-under=70) — no per-package strata to maintain in.coveragerc. - The escape hatch (upstream filing) is a real process improvement — generates contribution-trail value for the writeup, not just a technical workaround.
- It’s load-bearing in the right direction: the floor is a forcing function for honest engagement with each gap (either test locally, file upstream, or document non-applicability), not a license to add anti-tests.
Decision
Locked threshold and CI command
Coverage floor: 70% flat across the repo.
CI command:
uv run pytest --cov --cov-fail-under=70 --cov-report=term-missingUpdated Makefile coverage target (replaces existing ungated form pytest --cov=. --cov-report=term-missing):
coverage:
uv run pytest --cov --cov-fail-under=70 --cov-report=term-missingCo-locked process commitment — upstream-issue-filing for test-coverage gaps
When a local coverage gap is identified during Phase 1+ work, the developer applies this triage:
- Local test is the right home — write the test locally in
tests/unit/,tests/smoke/, ortests/integration/per the Q4 marker taxonomy (ADR-029). - Upstream library is the right home — file an issue at the upstream repo (
brandon-behring/eval-toolkitorbrandon-behring/runpod-deploy) with:- The proposed test pattern (sketch, not implementation).
- The rationale (why the test logic belongs upstream, not locally).
- Local file:line that depends on the absent test (if any).
trackedlabel. Then add a row todecisions/upstream_issues.mdwith the issue URL +[test-coverage-gap]tag + local file:line.
- Genuinely not testable, not upstream-applicable — leave a
# noqa: COVstyle code comment with rationale + add adecisions/upstream_issues.mdrow tagged[not-applicable]documenting the deferral.
A workaround that ignores the gap (lets coverage drop below 70% silently, or worse, adds a no-op test to inflate coverage) is an anti-pattern.
Updated decisions/upstream_issues.md ledger conventions
The ledger’s “How to use this ledger” section gains:
Test-coverage-gap entries: when a coverage gap is filed upstream rather than tested locally, add a row with the
[test-coverage-gap]tag in the Title column. When a gap is documented as not-applicable rather than filed, use the[not-applicable]tag. Both forms preserve the discipline trail without forcing local anti-tests.
A worked example row (placeholder — populated as actual gaps surface during Phase 1+):
| Date | Repo | Issue # | Title | Local site (file:line) | Status |
|---|---|---|---|---|---|
[TBD-at-Phase-1-entry] |
brandon-behring/eval-toolkit |
[TBD] |
[test-coverage-gap] example placeholder for upstream-filing convention |
[TBD] |
placeholder |
What’s explicitly out of scope
- Per-package coverage strata (option E) — kept simple at 70% flat; the upstream-filing escape hatch handles the orchestration-glue case structurally rather than via configuration.
- Production-grade coverage (≥85%) — currently out-of-scope per ADR-027 prototype-grade framing; reopen via superseding ADR if scope extends.
- Coverage on
tests/itself — pytest-cov’s default behavior excludes test files when--covhas no path argument; this is preserved.
Consequences
Positive
- Single CLI flag for the floor — no
.coveragercper-package configuration to drift. - Floor as forcing function for honest engagement: every gap gets one of three responses (local test / upstream issue / documented deferral). No silent coverage decay.
- Contribution trail: upstream-filed test-coverage gaps generate
decisions/upstream_issues.mdentries that double as evidence of library-first discipline for the writeup. - Aligns with ADR-027 prototype-grade framing: 70% is the right number for debugging-grade testing; 90% would force the layer to do work that belongs upstream.
Negative
- 70% is empirically chosen, not derived. If Phase 1 reveals it’s either trivially exceeded everywhere (suggesting we should raise it) or chronically failed on legitimate glue (suggesting we should rethink), reopen via ADR with the actual data.
- Upstream filing has friction: writing the issue + the proposed test pattern takes more developer time than writing a local anti-test. This is the intended friction — it forces a real “is this our problem or theirs” judgment rather than a reflex anti-test.
- CI gate fires on coverage drops even when the absolute level is still above 70%. Acceptable cost — drops are signal worth investigating; the upstream-filing escape hatch handles the legitimate-deferral case.
Limitation
The 70% threshold is a heuristic, not a methodological commitment. It is calibrated for the prototype-grade context locked by ADR-027 (where the math lives upstream). If a Phase 1+ surprise reveals the threshold is mis-calibrated, the data justifies the superseding ADR — not a quiet adjustment.
Extension condition for revisit
- Production-deployment scope extension lifts floor to 85% with
src/eval/at 90% (sincesrc/eval/is the most math-adjacent package and would carry production-grade calibration / threshold-fitting orchestration logic in that context). - First sustained Phase 1 violation of the floor with no viable upstream home triggers a re-evaluation — possibly drop to 60% (with explanation), or move to the stratified option E variant.
Alternatives considered
- (A) 90% — rejected; contradicts prototype-grade framing; forces low-value tests.
- (B) 80% — rejected; arbitrary middle ground with same low-value-test pressure delayed.
- (D) No floor; measure-only — rejected; loses the forcing-function benefit; coverage decay invisible until writeup.
- (E) Stratified 70% on src/, no floor on scripts/ — rejected in favor of flat 70% + upstream-filing escape hatch; the escape hatch is structurally cleaner than per-package strata and provides real process value.