Recipe: stock-out diagnostic#
Pattern: probe runpod-deploy gpu-list for the configured datacenters
before invoking runpod-deploy run, then choose one of four documented
actions when the configured gpu_order × datacenters matrix is empty
in your target cloud_type.
Why this is a recipe, not a schema feature#
Detecting a stock-out is a deploy-domain concern, and runpod-deploy
already owns it: runpod-deploy gpu-list --datacenter <id> returns
per-GPU availability and price for one datacenter, and
runpod-deploy validate --check-availability walks the configured
gpu_order across datacenters and raises if nothing is stocked.
What is not a deploy-domain concern is the decision you make once
you’ve detected the stock-out. Should you wait? Widen the gpu_order?
Switch cloud_type? Skip this run and ship a follow-up? Those are
consumer-domain calls — they depend on your release cadence, budget,
and how time-sensitive the run is. So the decision tree lives here, as
a recipe, not as a CLI subcommand that pretends to know your priorities.
Pattern (bash)#
Run this as a pre-launch step in your sweep driver. It exits non-zero when the configured matrix is empty, so the caller can short-circuit instead of paying the multi-minute retry-with-backoff penalty against a genuinely empty cluster.
#!/usr/bin/env bash
set -euo pipefail
CONFIG=${1:-configs/runpod/headline.yaml}
# Cheap path: let runpod-deploy's existing preflight raise on stock-out.
# This walks gpu_order × datacenters and raises if no GPU is stocked.
if runpod-deploy validate --config "$CONFIG" --check-availability; then
exec runpod-deploy run --config "$CONFIG"
fi
# Diagnostic path: validate said no stock. Print one gpu-list per
# datacenter so the operator can see exactly what IS stocked right now,
# then surface the four-action menu.
echo
echo "==> STOCK-OUT: no GPU in your gpu_order is currently stocked"
echo " across the configured datacenters in your cloud_type."
echo
echo " Currently-stocked GPUs per datacenter:"
for dc in $(python -c "import sys, yaml; print(' '.join(yaml.safe_load(open(sys.argv[1]))['pod']['datacenters']))" "$CONFIG"); do
echo " --- $dc ---"
runpod-deploy gpu-list --datacenter "$dc" --no-prices || true
done
cat <<'ACTIONS'
Recommended actions:
(a) Wait 15-30 min for inventory to recover, then re-run this script
(b) Widen `pod.gpu_order` in the config to include GPUs from the
list above
(c) Switch to COMMUNITY cloud: add `pod.cloud_type: COMMUNITY` to
the config (note: SECURE-only constraints like network volumes
will not work on COMMUNITY)
(d) Defer this run: skip the version, ship the next one without
this slice, file a tracking issue noting the stock-out window
ACTIONS
exit 3
The exit 3 is deliberately distinct from exit 1 (validation failure)
and exit 2 (CLI usage error), so callers can branch on it. The four
actions map to consumer-domain decisions; (b) and (c) are reversible
config edits, (a) is patience, (d) is a calendar trade-off.
What lives where#
Concern |
Owner |
|---|---|
Per-datacenter GPU availability + price |
|
Walking |
|
Pod provisioning + multi-DC failover |
|
Choosing between wait / widen / switch / defer |
Your sweep driver (this recipe) |
Filing a tracking issue when a stock-out blocks a release |
Your release process |
Anti-pattern to avoid#
Do not bake the four-action decision tree into runpod-deploy itself.
A future runpod-deploy autoretry-on-stockout flag would have to
guess — wait how long? widen to which GPUs? switch only if pricing is
within X%? — and any choice is wrong for some consumer. Keep the policy
in your driver where you can see and tune it.
Similarly, do not parse gpu-list’s human-readable output in a
production driver. Today the output is for operators reading a terminal;
if you need machine-readable per-datacenter stock data, prefer
runpod-deploy validate --check-availability (which exits non-zero on
empty) over scraping gpu-list.
See also#
local-preflight-then-run.md— the canonical pre-launch pattern; this recipe is a stock-out-specific specialization of the same Makefile shape.multi-config-sweep.md— most sweep drivers call a pre-flight check before the per-config loop; this recipe is what that check should look like for stock-out.