Compute infrastructure and cost discipline — runpod-deploy 0.7.7 primitives, GPU failover, adaptive batch sizing, dual-layer cost tracking

Published

May 15, 2026

ADR-020: Compute infrastructure and cost discipline

Status

Accepted (2026-05-15). Concretizes A-001 (runpod-deploy infrastructure assumption) and A-002 (compute budget envelope) into specific runpod-deploy 0.7.7 configuration. Complements ADR-013 (pre-teardown persistence checklist) and ADR-019 (training recipe).

Context

A-001 originally framed runpod-deploy as the GPU-rental orchestration substrate; A-002 framed the budget at $25-$125 envelope. Phase 0-03 Q8 concretized both into a specific runpod-deploy 0.7.7 configuration.

Brandon’s Phase 0-03 Q8 surfaced two concerns that the original ADR-001 plus A-002 framing did not address.

First, H100 availability fluctuates on RunPod spot — pinning to a single GPU class risks unavailability mid-Phase-2. The runpod-deploy primitive pod.gpu_order is an ordered failover list — runpod-deploy walks the list until it finds an available GPU. The 8-class failover ladder ensures we get a usable GPU even under H100 stockout.

Second, batch sizing must adapt to whatever GPU class the failover ladder lands us on. Naive single-batch-config approach risks OOM on smaller-VRAM classes (A100 40GB, L40S 48GB, L40 24GB). The methodology-aligned move is to preserve the ADR-019 effective batch invariant (32) while scaling per_device plus grad_accum together — preserves the gradient-computation hyperparameter (effective batch) while adapting throughput to memory.

Brandon also flagged that we should leverage the runpod-deploy primitives rather than hand-rolling equivalents. The Phase 0-03 walk inspected the runpod-deploy 0.7.7 source plus recipes plus reference configs and confirmed — gpu_order failover plus datacenter failover plus budget.cost_cap_usd plus preflight.check_gpu_availability plus flash-attention-fallback recipe plus cost-reconciliation recipe plus per-pod manifest capture all exist as documented primitives. The library-first discipline from CLAUDE.md plus SPEC §Tech-Stack means we invoke these primitives correctly rather than rebuild them.

Decision

GPU failover ladder (pod.gpu_order)

Tier 1 — 80GB, bf16 native, flash-attention-2 native — H100 family plus H200 family. - NVIDIA H100 80GB HBM3 - NVIDIA H100 NVL - NVIDIA H100 SXM - NVIDIA H100 PCIe - NVIDIA H200 - NVIDIA H200 NVL

Tier 2 — 80GB, bf16 native, flash-attention-2 native, ~50% H100 throughput — A100 80GB. - NVIDIA A100-SXM4-80GB - NVIDIA A100 80GB PCIe

Tier 3 — 48GB, bf16 native, flash-attention-2 with fallback — L40S. - NVIDIA L40S

Tier 4 (emergency fallback) — 40GB, bf16 OK, may need flash-attn fallback. - NVIDIA A100-SXM4-40GB

Avoid (no bf16, V100 / Pascal / Turing) plus avoid (too small for max_len=8192 batch — RTX 4000 / A4000 24GB).

Datacenter failover (pod.datacenters)

[US-MD-1, EU-RO-1] — dual-DC capacity resilience matching the prompt-injection-v3 reference config plus the runpod-deploy config-reference.md documented pattern.

Adaptive batch sizing (BATCH_TABLE)

Pre-locked lookup table keyed on detected GPU class (torch.cuda.get_device_name) preserving effective batch equals 32 across all classes.

BATCH_TABLE = {
    "H100":     {"per_device": 16, "grad_accum": 2},
    "H200":     {"per_device": 16, "grad_accum": 2},
    "A100-80G": {"per_device": 16, "grad_accum": 2},
    "A100-40G": {"per_device":  8, "grad_accum": 4},
    "L40S":     {"per_device":  8, "grad_accum": 4},
    "L40":      {"per_device":  4, "grad_accum": 8},
}

gpu_name = torch.cuda.get_device_name(0)
gpu_class = classify_gpu(gpu_name)  # maps device name -> table key
try:
    cfg = BATCH_TABLE[gpu_class]
except KeyError:
    raise RuntimeError(
        f"GPU class {gpu_class!r} not in BATCH_TABLE; add explicit entry"
        f" (gpu_name={gpu_name!r})"
    )

NOT hyperparameter tuning — effective batch (the actual gradient-computation hyperparameter) is held constant at 32 across all GPU classes; per_device + grad_accum are throughput knobs that do not change the gradient computation. Preserves SPEC §2 hyperparameter-immutability invariant.

If runpod-deploy lands us on an unlisted GPU (e.g., a new H300 class), the script fails loudly rather than silently defaulting — add explicit BATCH_TABLE entry plus update this ADR via superseding ADR if needed.

flash_attention_2 fallback (per runpod-deploy recipe)

import torch
from transformers import AutoModelForSequenceClassification

try:
    model = AutoModelForSequenceClassification.from_pretrained(
        "answerdotai/ModernBERT-base",
        revision=SHA_PIN,  # per ADR-016 + ADR-018 manifest
        attn_implementation="flash_attention_2",
        torch_dtype=torch.bfloat16,
        num_labels=2,
    )
except (ValueError, ImportError):
    model = AutoModelForSequenceClassification.from_pretrained(
        "answerdotai/ModernBERT-base",
        revision=SHA_PIN,
        torch_dtype=torch.bfloat16,
        num_labels=2,
    )
    # Log fallback via runpod-deploy events.emit_event so audit trail captures
    # which physical config produced each per-row prediction
    log_event("flash_attn_fallback", gpu=torch.cuda.get_device_name(0))

Cost cap (dual-layer)

Layer 1 — per-job orchestrator enforcement.

budget:
  cost_cap_usd: 125.0                # = A-002 soft cap; one bad pod cannot exceed
  assumed_hourly_rate_usd: 3.50      # H100 spot midpoint; reconciled post-first-run
  poll_interval_sec: 60

Layer 2 — project-wide hard cap (CI-gated, ledger-driven).

# scripts/cost_rollup.py walks artifacts/runpod/*/runpod_deploy_pull_manifest.json
# plus API call logs and emits evals/cost_ledger.csv with cumulative spend
# Fails CI if cumulative > $200

Cost ledger schema — evals/cost_ledger.csv columns — timestamp, target, est_cost_usd, actual_cost_usd, gpu_hours, api_calls, notes.

Trigger thresholds — cumulative $80 (~64% of soft cap) flags for review; cumulative $125 (soft cap) escalation discussion documented in evals/cost_decisions.md; cumulative $200 (hard cap) STOP plus superseding ADR required.

Preflight discipline

Before any billed run:

runpod-deploy validate --config configs/runpod/headline.yaml --all
runpod-deploy run --config configs/runpod/headline.yaml --dry-run

validate --all runs schema validation plus DC reachability plus GPU stock check. run --dry-run hits runpodctl + GraphQL pricing without provisioning anything; tells us “would this find a GPU at this price right now?”

Library imports ledger update

decisions/library_imports.md runpod-deploy section populated with eight primitives we invoke.

Consequences

Positive

  • Library-first discipline preserved — runpod-deploy primitives invoked end-to-end; no hand-rolled GPU orchestration or cost tracking
  • GPU availability handled by orchestrator failover; no project code required to handle H100 stockouts
  • Adaptive batch sizing preserves the ADR-019 effective-batch invariant across all GPU classes — methodology integrity preserved under GPU substitution
  • flash_attention_2 fallback handles cross-GPU-class portability without code failure on smaller GPUs
  • Dual-layer cost cap — per-job orchestrator enforcement plus project-wide CI-gated rollup — prevents both single-pod runaway cost AND cumulative drift
  • Per-pod cost reconciliation is automatic via the runpod-deploy manifest schema; no extra code
  • assumed_hourly_rate_usd reconciliation pattern (per cost-reconciliation recipe) tunes the implicit timeout after first run

Negative

  • The 8-class failover ladder means we may land on slower hardware than H100; wall-clock may extend up to ~2x on A100 80GB or L40S; still inside the cost cap because spot pricing is correspondingly lower
  • Adaptive batch sizing adds modest code complexity (BATCH_TABLE lookup plus GPU classifier function); manageable
  • Per-GPU-class throughput varies — for cross-rung wall-clock comparison reporting, the GPU class is captured in per-pod manifest as a confounder to acknowledge in the methodology spoke
  • A new GPU class outside the BATCH_TABLE causes fail-loud (intended); requires a superseding ADR if we want to extend the table without a full Phase 0 cycle
  • assumed_hourly_rate_usd=$3.50 is an estimate; if actual H100 spot averages much higher (e.g., $4.50), the implicit timeout (cost_cap / assumed_rate) shrinks; reconciliation after first run catches this

Phase 1 deliverables

  • configs/runpod/headline.yaml — locked pod.gpu_order + pod.datacenters + budget
  • src/training/batch_table.py — BATCH_TABLE + classify_gpu + fail-loud KeyError
  • src/training/load_modernbert.py — flash-attn-2 with SDPA fallback
  • scripts/cost_rollup.py — per-pod manifest aggregation plus CI hard-cap gate
  • evals/cost_ledger.csv — populated per Makefile-target invocation
  • decisions/library_imports.md — runpod-deploy section populated with eight primitives

Alternatives considered

  • Single GPU class (H100 only) — rejected because H100 spot availability fluctuates; pinning to a single class risks unavailability mid-Phase-2. The 8-class failover is the runpod-deploy idiom and handles stockouts transparently.
  • Single datacenter (US-MD-1 only) — rejected for the same reason; dual-DC failover is the runpod-deploy idiom plus the prompt-injection-v3 reference config pattern.
  • auto_find_batch_size=True via HF Trainer — rejected because adds nondeterminism (binary search varies per run) plus ~30 second warmup per run. Pre-locked BATCH_TABLE is deterministic plus auditable.
  • GPU class as a config parameter (split into per-GPU-class configs) — rejected because proliferates configs (one per GPU class times one per rung); the runtime-detected approach is simpler and matches runpod-deploy’s failover semantics.
  • flash_attention_2 required (fail-loud on unsupported) — rejected because the failover ladder may legitimately land us on smaller GPUs without flash-attn-2 support; the recipe’s degrade-to-SDPA pattern is the runpod-deploy idiom for this case.
  • Single-layer cost cap (orchestrator-only via budget.cost_cap_usd) — rejected because budget.cost_cap_usd is per-pod enforcement; without the project-wide rollup, cumulative spend across many pods could drift above the A-002 envelope. The dual-layer approach handles both runaway-pod and cumulative-drift cases.
  • Hand-rolled GPU orchestration — rejected per CLAUDE.md library-first discipline; runpod-deploy 0.7.7 has all the primitives we need.

References

See frontmatter references list. Primary anchors — runpod-deploy 0.7.7 source at https://github.com/brandon-behring/runpod-deploy (per ADR-068 narrow-relaxation: local-fs author-machine path canonicalized to upstream URL); RunPod pricing page; NVIDIA H100 tensor-core whitepaper; OpenAI pricing; Anthropic model card pricing; HuggingFace TrainingArguments docs; ADR-001 deadline-plus-scope-ambition framing; ADR-013 pre-teardown persistence checklist; ADR-019 training recipe for batch-size context.