GPU failover ladder priority refresh — A100-SXM4-80GB to position 1 + US-WA-1/US-CA-2 datacenter additions + EU-RO-1 drop (frozen-probe + LoRA canonical; full-FT deferred to post-rehearsal)
ADR-049: GPU failover ladder priority refresh — A100-80G to position 1
Status
Accepted (2026-05-17). Supersedes ADR-020 §“GPU failover ladder (pod.gpu_order)” ONLY on the position-order axis; the SET of GPU SKUs + the four-tier structure + dual-layer cost cap discipline + BATCH_TABLE invariant from ADR-020 are unchanged and remain locked.
Context
ADR-020 locked the GPU failover ladder at Phase 0-03 with H100/H200 family as Tier 1 (positions 1-6), A100-80GB as Tier 2 (positions 7-8), L40S as Tier 3 (position 9), and A100-40GB as Tier 4 emergency (position 10). The 2-DC failover was [US-MD-1, EU-RO-1]. At the time of locking (2026-05-15) H100 spot availability on RunPod was assumed sufficient for the canonical Phase 2 runs.
At Phase 4 close (2026-05-17 00:30 UTC) the canonical-run readiness audit fired runpod-deploy gpu-list --datacenter <DC> --no-prices against 6 RunPod datacenters + runpod-deploy validate --config configs/runpod/headline-frozen_probe.yaml --all. Live findings:
Hopper / H200 stock: - H100 80GB HBM3: present in CA-MTL-1, EU-NL-1, US-CA-2 — all stockStatus="" (unavailable) - H100 NVL / SXM / PCIe: absent or unavailable across all probed DCs - H200 / H200 NVL: H200 NVL present in US-MD-1 + US-CA-2, both stockStatus=""
Ampere stock (the configured T2 fallback): - A100-SXM4-80GB: US-MD-1 Low, US-WA-1 Low, US-CA-2 "", EU-RO-1 "" - A100 80GB PCIe: EU-RO-1 only (and "")
L40S (T3): not in any DC I probed.
Net: validator output "[gpu:US-MD-1] all configured GPUs are Low stock — provisioning may fail" + "[gpu:EU-RO-1] no configured GPU available". The configured priority (H100/H200 at positions 1-6) is aspirational — none of those SKUs would actually provision; the orchestrator would walk the ladder until it hit A100-SXM4-80GB at position 7 in US-MD-1, with a Low-stock warning.
User direction 2026-05-17 via 3-question /AskUserQuestion session: 1. Full-FT plan → defer to post-rehearsal (fire after v0.9.0-rc1 dress-rehearsal passes cleanly; gated by follow-up ADR superseding ADR-049’s defer status) 2. ADR-049 scope → narrow (gpu_order priority + datacenters only; keep ADR-020 tier set + cap discipline + BATCH_TABLE intact) 3. Commit batching → all 3 commits now (X1 torch CUDA pin + X2 ADR-049 + X3 runpod-deploy to dev extras)
Decision
Refreshed pod.gpu_order (all 3 headline-* configs)
The SET is unchanged from ADR-020; the order is repositioned to align with 2026-05-17 stock reality:
| Position | SKU | Tier per ADR-020 | Stock as-of |
|---|---|---|---|
| 1 | NVIDIA A100-SXM4-80GB | T2 | US-MD-1 Low, US-WA-1 Low |
| 2 | NVIDIA A100 80GB PCIe | T2 | nominal fallback (EU-RO-1 dropped) |
| 3 | NVIDIA H100 80GB HBM3 | T1 | mid-run stock-recovery fallback |
| 4 | NVIDIA H100 NVL | T1 | ” |
| 5 | NVIDIA H100 SXM | T1 | ” |
| 6 | NVIDIA H100 PCIe | T1 | ” |
| 7 | NVIDIA H200 | T1 | ” |
| 8 | NVIDIA H200 NVL | T1 | ” |
| 9 | NVIDIA L40S | T3 | flash-attn-fallback per ADR-020 |
| 10 | NVIDIA A100-SXM4-40GB | T4 | emergency per ADR-020 |
T1 SKUs are kept in the ladder (positions 3-8) so that if RunPod stock recovers mid-Phase-2 the orchestrator naturally prefers them (the orchestrator re-runs preflight.check_gpu_availability per pod, not per Phase). The Tier structure from ADR-020 is intact; only the priority ordering moves.
Refreshed pod.datacenters (all 3 headline-* configs)
[US-MD-1, US-WA-1, US-CA-2] (was [US-MD-1, EU-RO-1])
- US-MD-1 kept as primary (A100-SXM4-80GB at Low stock; matches ADR-020 default).
- US-WA-1 added as secondary (A100-SXM4-80GB at Low stock; doubles provisioning success probability via DC-level failover).
- US-CA-2 added as nominal fallback (A100-SXM4-80GB present in DC but currently stockStatus=““; serves as a recovery candidate if stock returns mid-Phase-2).
- EU-RO-1 dropped (no A100-80G stock; no Hopper SKUs in DC at all; dead weight in the failover chain that would force the orchestrator to probe + reject before falling through to the actually-viable DCs).
Per-rung scope
| Rung | Canonical run this window? | Cap | Rationale |
|---|---|---|---|
| classical-floor | already runs locally | n/a | CPU-only per ADR-017; no GPU pod fires |
| frozen-probe | yes | $40 (unchanged) | cheapest GPU rung; ~1-2h wall on A100-80G; best canary |
| LoRA | yes | $60 (unchanged) | ~3-5h wall on A100-80G; intermediate-quality rung |
| full-FT | DEFERRED to post-rehearsal | $100 (unchanged) | ~6-12h wall on A100-80G; Low-stock + long-wall = mid-run preemption risk; fires after v0.9.0-rc1 dress-rehearsal passes cleanly per ADR-046 Q7 + ADR-033 |
The full-FT defer is documented in docs/ROADMAP.md Phase 4 close note rehearsal-tag dispatch checklist (new Step 7); a follow-up ADR will supersede ADR-049’s defer status when full-FT actually fires.
What ADR-020 retains (unchanged)
- The SET of 10 GPU SKUs in the ladder.
- The 4-tier structure (T1 80GB+bf16+flash-attn-2-native, T2 80GB+bf16+flash-attn-2-native+~50% H100 throughput, T3 48GB+bf16+flash-attn-fallback, T4 40GB+bf16+may-need-fallback).
- Adaptive batch sizing per BATCH_TABLE keyed on detected GPU class (ADR-020 §“Adaptive batch sizing”).
- Effective batch invariant (32) preserved across all GPU classes.
- flash-attn-2 fallback recipe in
src/training/load_modernbert.py. - Dual-layer cost cap discipline ($40/$60/$100 per-pod via
budget.cost_cap_usd; $200 total viascripts/cost_rollup.py --check). assumed_hourly_rate_usd: 3.50(still a reasonable Ampere/Hopper midpoint).- Preflight discipline (
runpod-deploy validate --all+runpod-deploy run --dry-runbefore any billed run). - All cost-tracking primitives + per-pod manifest capture + soft/hard cap thresholds + escalation policy.
Consequences
Positive:
runpod-deploy validate --allnow reports A100-SXM4-80GB at Low stock as the matched candidate instead of the misleading"all configured GPUs are Low stock — provisioning may fail"warning that fired pre-ADR-049.- Operator fires frozen-probe + LoRA with the actual provisionable GPU class as the priority candidate, not a phantom H100/H200 priority that wouldn’t provision.
- Datacenter expansion to 3 DCs (from 2) increases mid-run preemption resilience for the canonical runs.
- T1 SKUs kept in the ladder mean future stock recovery is captured automatically without another ADR refresh.
- Tier structure + cap discipline + BATCH_TABLE invariant from ADR-020 are unchanged — the methodology contract is preserved.
Negative / cost:
- Full-FT defer means Phase 5 WRITEUP draft (which begins post-rehearsal) cannot include full-FT numbers in the rehearsal-tag rendering; rehearsal site renders an incomplete 3-rung comparison.
- If full-FT exposes pipeline issues that frozen-probe + LoRA did not, the fix-forward cycle pushes v1.0.0 tag by one rc bump.
- Adding US-CA-2 as a fallback DC adds preflight latency to every
runpod-deploy validatecall (one extra DC stock-check round-trip) — sub-second cost; not a real penalty.
Neutral:
- ADR-020 stays Accepted (not Superseded) since the SET + TIER STRUCTURE + cap discipline are unchanged; SUBMISSION_AUDIT.md notes ADR-049 as “supersedes-in-part” on the priority-order axis only.
- No code changes required in
src/training/load_modernbert.pyorsrc/training/batch_table.pysince those depend on the SET + TIER STRUCTURE (unchanged), not on the order.
Alternatives Considered
- Broad failover refresh (add Blackwell SKUs: B200, B300 SXM6 AC, RTX PRO 6000 Blackwell {Server, Workstation}; add consumer fallback RTX 5090/4090). Rejected per user direction Q2: out of scope for the immediate canonical run. May fire as a separate ADR-050 if full-FT post-rehearsal needs more provisioning options.
- Keep ADR-020 priority + accept the “all configured GPUs are Low stock” warning + fire anyway. Rejected: misleading validator output erodes the audit trail; the warning would surface in every CI run + every operator preflight.
- Cut full-FT from submission entirely (3-rung ladder in Phase 5). Rejected per user direction Q1: the LoRA-vs-full-FT comparison is a load-bearing narrative element of ADR-019 + the writeup.
- Pre-provision a reserved A100-SXM4-80GB pod ahead of the canonical run (skip the orchestrator’s failover chain). Rejected: violates the runpod-deploy library-first discipline per ADR-020; the failover chain IS the resilience mechanism.
References
decisions/ADR-019-lora-and-transformer-training-recipe.md— 4-rung training recipe (classical-floor + frozen-probe + LoRA + full-FT)decisions/ADR-020-compute-infrastructure-and-cost-discipline.md— ADR being partially superseded (gpu_order priority axis only)decisions/ADR-033-github-release-strategy-rehearsal-plus-submission.md— v0.9.0-rc1 rehearsal-tag dispatch + fix-forward via rc2 if rehearsal exposes issuesdecisions/ADR-044-phase-2-training-implementation-bundle.md— Phase 2 implementation bundle (per-rung orchestration Q6)decisions/ADR-046-phase-4-analysis-implementation-bundle.md— Phase 4 implementation bundle (Q7 phase-tailoring lock that gates Phase 5 entry on rehearsal)decisions/library_imports.md— runpod-deploy primitives invoked (unchanged by this ADR)
Transcript
This ADR has no dedicated transcript file. The decision was made via a 3-question /AskUserQuestion session on 2026-05-17 in the same conversation that produced the Phase 4 Commits 2-6 implementation (transcript transcripts/2026-05-17__phase-4-commits-2-through-6-implementation.md captures the Q&A + the ADR-049 + X1 + X3 commit sequence).