Compute infrastructure and cost discipline — runpod-deploy 0.7.7 primitives, GPU failover, adaptive batch sizing, dual-layer cost tracking
ADR-020: Compute infrastructure and cost discipline
Status
Accepted (2026-05-15). Concretizes A-001 (runpod-deploy infrastructure assumption) and A-002 (compute budget envelope) into specific runpod-deploy 0.7.7 configuration. Complements ADR-013 (pre-teardown persistence checklist) and ADR-019 (training recipe).
Context
A-001 originally framed runpod-deploy as the GPU-rental orchestration substrate; A-002 framed the budget at $25-$125 envelope. Phase 0-03 Q8 concretized both into a specific runpod-deploy 0.7.7 configuration.
Brandon’s Phase 0-03 Q8 surfaced two concerns that the original ADR-001 plus A-002 framing did not address.
First, H100 availability fluctuates on RunPod spot — pinning to a single GPU class risks unavailability mid-Phase-2. The runpod-deploy primitive pod.gpu_order is an ordered failover list — runpod-deploy walks the list until it finds an available GPU. The 8-class failover ladder ensures we get a usable GPU even under H100 stockout.
Second, batch sizing must adapt to whatever GPU class the failover ladder lands us on. Naive single-batch-config approach risks OOM on smaller-VRAM classes (A100 40GB, L40S 48GB, L40 24GB). The methodology-aligned move is to preserve the ADR-019 effective batch invariant (32) while scaling per_device plus grad_accum together — preserves the gradient-computation hyperparameter (effective batch) while adapting throughput to memory.
Brandon also flagged that we should leverage the runpod-deploy primitives rather than hand-rolling equivalents. The Phase 0-03 walk inspected the runpod-deploy 0.7.7 source plus recipes plus reference configs and confirmed — gpu_order failover plus datacenter failover plus budget.cost_cap_usd plus preflight.check_gpu_availability plus flash-attention-fallback recipe plus cost-reconciliation recipe plus per-pod manifest capture all exist as documented primitives. The library-first discipline from CLAUDE.md plus SPEC §Tech-Stack means we invoke these primitives correctly rather than rebuild them.
Decision
GPU failover ladder (pod.gpu_order)
Tier 1 — 80GB, bf16 native, flash-attention-2 native — H100 family plus H200 family. - NVIDIA H100 80GB HBM3 - NVIDIA H100 NVL - NVIDIA H100 SXM - NVIDIA H100 PCIe - NVIDIA H200 - NVIDIA H200 NVL
Tier 2 — 80GB, bf16 native, flash-attention-2 native, ~50% H100 throughput — A100 80GB. - NVIDIA A100-SXM4-80GB - NVIDIA A100 80GB PCIe
Tier 3 — 48GB, bf16 native, flash-attention-2 with fallback — L40S. - NVIDIA L40S
Tier 4 (emergency fallback) — 40GB, bf16 OK, may need flash-attn fallback. - NVIDIA A100-SXM4-40GB
Avoid (no bf16, V100 / Pascal / Turing) plus avoid (too small for max_len=8192 batch — RTX 4000 / A4000 24GB).
Datacenter failover (pod.datacenters)
[US-MD-1, EU-RO-1] — dual-DC capacity resilience matching the prompt-injection-v3 reference config plus the runpod-deploy config-reference.md documented pattern.
Adaptive batch sizing (BATCH_TABLE)
Pre-locked lookup table keyed on detected GPU class (torch.cuda.get_device_name) preserving effective batch equals 32 across all classes.
BATCH_TABLE = {
"H100": {"per_device": 16, "grad_accum": 2},
"H200": {"per_device": 16, "grad_accum": 2},
"A100-80G": {"per_device": 16, "grad_accum": 2},
"A100-40G": {"per_device": 8, "grad_accum": 4},
"L40S": {"per_device": 8, "grad_accum": 4},
"L40": {"per_device": 4, "grad_accum": 8},
}
gpu_name = torch.cuda.get_device_name(0)
gpu_class = classify_gpu(gpu_name) # maps device name -> table key
try:
cfg = BATCH_TABLE[gpu_class]
except KeyError:
raise RuntimeError(
f"GPU class {gpu_class!r} not in BATCH_TABLE; add explicit entry"
f" (gpu_name={gpu_name!r})"
)NOT hyperparameter tuning — effective batch (the actual gradient-computation hyperparameter) is held constant at 32 across all GPU classes; per_device + grad_accum are throughput knobs that do not change the gradient computation. Preserves SPEC §2 hyperparameter-immutability invariant.
If runpod-deploy lands us on an unlisted GPU (e.g., a new H300 class), the script fails loudly rather than silently defaulting — add explicit BATCH_TABLE entry plus update this ADR via superseding ADR if needed.
flash_attention_2 fallback (per runpod-deploy recipe)
import torch
from transformers import AutoModelForSequenceClassification
try:
model = AutoModelForSequenceClassification.from_pretrained(
"answerdotai/ModernBERT-base",
revision=SHA_PIN, # per ADR-016 + ADR-018 manifest
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
num_labels=2,
)
except (ValueError, ImportError):
model = AutoModelForSequenceClassification.from_pretrained(
"answerdotai/ModernBERT-base",
revision=SHA_PIN,
torch_dtype=torch.bfloat16,
num_labels=2,
)
# Log fallback via runpod-deploy events.emit_event so audit trail captures
# which physical config produced each per-row prediction
log_event("flash_attn_fallback", gpu=torch.cuda.get_device_name(0))Cost cap (dual-layer)
Layer 1 — per-job orchestrator enforcement.
budget:
cost_cap_usd: 125.0 # = A-002 soft cap; one bad pod cannot exceed
assumed_hourly_rate_usd: 3.50 # H100 spot midpoint; reconciled post-first-run
poll_interval_sec: 60Layer 2 — project-wide hard cap (CI-gated, ledger-driven).
# scripts/cost_rollup.py walks artifacts/runpod/*/runpod_deploy_pull_manifest.json
# plus API call logs and emits evals/cost_ledger.csv with cumulative spend
# Fails CI if cumulative > $200Cost ledger schema — evals/cost_ledger.csv columns — timestamp, target, est_cost_usd, actual_cost_usd, gpu_hours, api_calls, notes.
Trigger thresholds — cumulative $80 (~64% of soft cap) flags for review; cumulative $125 (soft cap) escalation discussion documented in evals/cost_decisions.md; cumulative $200 (hard cap) STOP plus superseding ADR required.
Preflight discipline
Before any billed run:
runpod-deploy validate --config configs/runpod/headline.yaml --all
runpod-deploy run --config configs/runpod/headline.yaml --dry-runvalidate --all runs schema validation plus DC reachability plus GPU stock check. run --dry-run hits runpodctl + GraphQL pricing without provisioning anything; tells us “would this find a GPU at this price right now?”
Library imports ledger update
decisions/library_imports.md runpod-deploy section populated with eight primitives we invoke.
Consequences
Positive
- Library-first discipline preserved — runpod-deploy primitives invoked end-to-end; no hand-rolled GPU orchestration or cost tracking
- GPU availability handled by orchestrator failover; no project code required to handle H100 stockouts
- Adaptive batch sizing preserves the ADR-019 effective-batch invariant across all GPU classes — methodology integrity preserved under GPU substitution
- flash_attention_2 fallback handles cross-GPU-class portability without code failure on smaller GPUs
- Dual-layer cost cap — per-job orchestrator enforcement plus project-wide CI-gated rollup — prevents both single-pod runaway cost AND cumulative drift
- Per-pod cost reconciliation is automatic via the runpod-deploy manifest schema; no extra code
- assumed_hourly_rate_usd reconciliation pattern (per cost-reconciliation recipe) tunes the implicit timeout after first run
Negative
- The 8-class failover ladder means we may land on slower hardware than H100; wall-clock may extend up to ~2x on A100 80GB or L40S; still inside the cost cap because spot pricing is correspondingly lower
- Adaptive batch sizing adds modest code complexity (BATCH_TABLE lookup plus GPU classifier function); manageable
- Per-GPU-class throughput varies — for cross-rung wall-clock comparison reporting, the GPU class is captured in per-pod manifest as a confounder to acknowledge in the methodology spoke
- A new GPU class outside the BATCH_TABLE causes fail-loud (intended); requires a superseding ADR if we want to extend the table without a full Phase 0 cycle
- assumed_hourly_rate_usd=$3.50 is an estimate; if actual H100 spot averages much higher (e.g., $4.50), the implicit timeout (cost_cap / assumed_rate) shrinks; reconciliation after first run catches this
Phase 1 deliverables
- configs/runpod/headline.yaml — locked pod.gpu_order + pod.datacenters + budget
- src/training/batch_table.py — BATCH_TABLE + classify_gpu + fail-loud KeyError
- src/training/load_modernbert.py — flash-attn-2 with SDPA fallback
- scripts/cost_rollup.py — per-pod manifest aggregation plus CI hard-cap gate
- evals/cost_ledger.csv — populated per Makefile-target invocation
- decisions/library_imports.md — runpod-deploy section populated with eight primitives
Alternatives considered
- Single GPU class (H100 only) — rejected because H100 spot availability fluctuates; pinning to a single class risks unavailability mid-Phase-2. The 8-class failover is the runpod-deploy idiom and handles stockouts transparently.
- Single datacenter (US-MD-1 only) — rejected for the same reason; dual-DC failover is the runpod-deploy idiom plus the prompt-injection-v3 reference config pattern.
- auto_find_batch_size=True via HF Trainer — rejected because adds nondeterminism (binary search varies per run) plus ~30 second warmup per run. Pre-locked BATCH_TABLE is deterministic plus auditable.
- GPU class as a config parameter (split into per-GPU-class configs) — rejected because proliferates configs (one per GPU class times one per rung); the runtime-detected approach is simpler and matches runpod-deploy’s failover semantics.
- flash_attention_2 required (fail-loud on unsupported) — rejected because the failover ladder may legitimately land us on smaller GPUs without flash-attn-2 support; the recipe’s degrade-to-SDPA pattern is the runpod-deploy idiom for this case.
- Single-layer cost cap (orchestrator-only via budget.cost_cap_usd) — rejected because budget.cost_cap_usd is per-pod enforcement; without the project-wide rollup, cumulative spend across many pods could drift above the A-002 envelope. The dual-layer approach handles both runaway-pod and cumulative-drift cases.
- Hand-rolled GPU orchestration — rejected per CLAUDE.md library-first discipline; runpod-deploy 0.7.7 has all the primitives we need.
References
See frontmatter references list. Primary anchors — runpod-deploy 0.7.7 source at https://github.com/brandon-behring/runpod-deploy (per ADR-068 narrow-relaxation: local-fs author-machine path canonicalized to upstream URL); RunPod pricing page; NVIDIA H100 tensor-core whitepaper; OpenAI pricing; Anthropic model card pricing; HuggingFace TrainingArguments docs; ADR-001 deadline-plus-scope-ambition framing; ADR-013 pre-teardown persistence checklist; ADR-019 training recipe for batch-size context.