Recipe: cost reconciliation across past runs#
Pattern: read wall_time_sec, gpu_price_per_hour_usd,
gpu_price_source, and estimated_cost_usd from past
runpod_deploy_pull_manifest.json files to validate that the
budget.assumed_hourly_rate_usd you set in YAML is realistic, and to
catch cost drift across runs.
Why this is a recipe, not a schema feature#
Cost-reconciliation analysis is consumer-domain: which manifests to include, how to aggregate, what counts as “drift,” and what to do about it (bump the assumed rate, split the GPU classes into separate configs, escalate to a quota review) all depend on your project’s budget discipline and tolerance for variance. Those decisions don’t belong in a deployment-primitives library.
What runpod-deploy owns is the capture of the cost signal: the v2
manifest preserves both the assumed rate (implicit in cost_cap_usd
budgeting) and the actual costPerHr parsed from runpodctl pod get
(gpu_price_per_hour_usd with gpu_price_source: pod_describe). The
reconciliation is yours to drive.
If your assumed rate is much lower than the captured costPerHr, jobs
hit the timeout before reaching their cost cap; if much higher, you’re
over-paying for headroom. Comparing the two gives you data to tune
budget.assumed_hourly_rate_usd in your YAML configs.
One-shot inspection of the latest run#
LATEST=$(ls -dt artifacts/runpod/*/ | head -1)
runpod-deploy manifest-summary "$LATEST/runpod_deploy_pull_manifest.json"
Look at the wall_time_sec, price_usd/hr, est_cost_usd, and
cost_cap_usd lines. If est_cost_usd is much lower than
cost_cap_usd, the cap is loose; if it bumps against it, you may have
hit the timeout.
Sweep across many runs (Python)#
import json
from pathlib import Path
manifests = list(Path("artifacts/runpod").glob("*/runpod_deploy_pull_manifest.json"))
for path in sorted(manifests):
m = json.loads(path.read_text())
if m.get("schema_version") != "v2":
continue
if m.get("gpu_price_source") != "pod_describe":
continue # skip runs that fell back to assumed_rate
print(
f"{m['run_id']:30s} "
f"gpu={m['gpu_id']:30s} "
f"wall={m['wall_time_sec']:8.0f}s "
f"price=${m['gpu_price_per_hour_usd']:.2f}/hr "
f"est_cost=${m['estimated_cost_usd']:.2f} "
f"final_state={m['pod_final_state']}"
)
This gives you a per-GPU price table over time. If your assumed_hourly_rate_usd
is set at \(1.65 but H100 runs consistently report \)4.18, bump the
assumed rate (which lengthens the implicit timeout) or split the GPUs
into separate configs each with their own assumed rate.
Detecting failed/killed pods#
pod_final_state from runpodctl pod get’s desiredStatus field
distinguishes:
EXITED— clean shutdown after the run (your code finished or hit the success marker)RUNNING— pod still active when manifest was written (meanslifecycle.on_*: preserveand the pod was preserved)Anything else (
TERMINATED,FAILED,STOPPED) — surfaced as apod_killed_unexpectedevent inevents.jsonl. Cross-reference RunPod console history to find the cause.
When a run shows failed: true and pod_final_state: TERMINATED,
RunPod killed the pod mid-run — usually quota or capacity. Re-running
on a different DC (multi-DC failover) avoids the same outcome next time.
What lives where#
Concern |
Owner |
|---|---|
Capturing |
|
Capturing wall time and computing |
|
Distinguishing |
|
Aggregating across multiple manifests |
Your driver (or |
Deciding whether observed drift warrants tuning |
Your project’s cost discipline |
Filing a cost-anomaly issue when drift exceeds your tolerance |
Your release process |
Anti-pattern to avoid#
Don’t hard-code cost-tolerance thresholds into the YAML config or wrap
them in deploy-time validate rules. Cost drift is a post-hoc
analytical concern — what counts as “too much” depends on the project’s
budget discipline, which evolves without code releases.
Don’t re-derive estimated_cost_usd in your driver from
wall_time_sec * gpu_price_per_hour_usd; the manifest already
computes it (with proper handling of the pod_describe vs
assumed_rate price-source fallback) at manifest._estimated_cost_usd.
Re-deriving risks divergence when the formula changes.
See also#
multi-config-sweep.md— the typical source of many manifests; pairmanifest-summary --rootafter a sweep completes.embed-deploy-metadata.md— the same fields the manifest captures (local_git_sha,payload_lockfile) are exposed via thecapture-envsubcommand if you need them in your own evals manifest.local-postprocess-after-run.md— walks<run_dir>for aggregation; cost reconciliation is one flavor of post-processing.