# Recipe: cost reconciliation across past runs **Pattern:** read `wall_time_sec`, `gpu_price_per_hour_usd`, `gpu_price_source`, and `estimated_cost_usd` from past `runpod_deploy_pull_manifest.json` files to validate that the `budget.assumed_hourly_rate_usd` you set in YAML is realistic, and to catch cost drift across runs. ## Why this is a recipe, not a schema feature Cost-reconciliation analysis is *consumer-domain*: which manifests to include, how to aggregate, what counts as "drift," and what to do about it (bump the assumed rate, split the GPU classes into separate configs, escalate to a quota review) all depend on your project's budget discipline and tolerance for variance. Those decisions don't belong in a deployment-primitives library. What `runpod-deploy` owns is the *capture* of the cost signal: the v2 manifest preserves both the assumed rate (implicit in `cost_cap_usd` budgeting) and the actual `costPerHr` parsed from `runpodctl pod get` (`gpu_price_per_hour_usd` with `gpu_price_source: pod_describe`). The reconciliation is yours to drive. If your assumed rate is much lower than the captured `costPerHr`, jobs hit the timeout before reaching their cost cap; if much higher, you're over-paying for headroom. Comparing the two gives you data to tune `budget.assumed_hourly_rate_usd` in your YAML configs. ## One-shot inspection of the latest run ```sh LATEST=$(ls -dt artifacts/runpod/*/ | head -1) runpod-deploy manifest-summary "$LATEST/runpod_deploy_pull_manifest.json" ``` Look at the `wall_time_sec`, `price_usd/hr`, `est_cost_usd`, and `cost_cap_usd` lines. If `est_cost_usd` is much lower than `cost_cap_usd`, the cap is loose; if it bumps against it, you may have hit the timeout. ## Sweep across many runs (Python) ```python import json from pathlib import Path manifests = list(Path("artifacts/runpod").glob("*/runpod_deploy_pull_manifest.json")) for path in sorted(manifests): m = json.loads(path.read_text()) if m.get("schema_version") != "v2": continue if m.get("gpu_price_source") != "pod_describe": continue # skip runs that fell back to assumed_rate print( f"{m['run_id']:30s} " f"gpu={m['gpu_id']:30s} " f"wall={m['wall_time_sec']:8.0f}s " f"price=${m['gpu_price_per_hour_usd']:.2f}/hr " f"est_cost=${m['estimated_cost_usd']:.2f} " f"final_state={m['pod_final_state']}" ) ``` This gives you a per-GPU price table over time. If your `assumed_hourly_rate_usd` is set at $1.65 but H100 runs consistently report $4.18, bump the assumed rate (which lengthens the implicit timeout) or split the GPUs into separate configs each with their own assumed rate. ## Detecting failed/killed pods `pod_final_state` from `runpodctl pod get`'s `desiredStatus` field distinguishes: - `EXITED` — clean shutdown after the run (your code finished or hit the success marker) - `RUNNING` — pod still active when manifest was written (means `lifecycle.on_*: preserve` and the pod was preserved) - Anything else (`TERMINATED`, `FAILED`, `STOPPED`) — surfaced as a `pod_killed_unexpected` event in `events.jsonl`. Cross-reference RunPod console history to find the cause. When a run shows `failed: true` and `pod_final_state: TERMINATED`, RunPod killed the pod mid-run — usually quota or capacity. Re-running on a different DC (multi-DC failover) avoids the same outcome next time. ## What lives where | Concern | Owner | |---|---| | Capturing `costPerHr` from `runpodctl pod get` and persisting it | `runpod-deploy` (`telemetry._extract_price` → manifest `gpu_price_per_hour_usd`) | | Capturing wall time and computing `estimated_cost_usd` | `runpod-deploy` (`manifest._estimated_cost_usd`) | | Distinguishing `pod_describe` vs `assumed_rate` price sources | `runpod-deploy` (`gpu_price_source` manifest field) | | Aggregating across multiple manifests | Your driver (or `runpod_deploy.forensics.walk_run_dirs` + `load_manifest`) | | Deciding whether observed drift warrants tuning `assumed_hourly_rate_usd` | Your project's cost discipline | | Filing a cost-anomaly issue when drift exceeds your tolerance | Your release process | ## Anti-pattern to avoid Don't hard-code cost-tolerance thresholds into the YAML config or wrap them in deploy-time `validate` rules. Cost drift is a *post-hoc* analytical concern — what counts as "too much" depends on the project's budget discipline, which evolves without code releases. Don't re-derive `estimated_cost_usd` in your driver from `wall_time_sec * gpu_price_per_hour_usd`; the manifest already computes it (with proper handling of the `pod_describe` vs `assumed_rate` price-source fallback) at `manifest._estimated_cost_usd`. Re-deriving risks divergence when the formula changes. ## See also - [`multi-config-sweep.md`](multi-config-sweep.md) — the typical source of many manifests; pair `manifest-summary --root` after a sweep completes. - [`embed-deploy-metadata.md`](embed-deploy-metadata.md) — the same fields the manifest captures (`local_git_sha`, `payload_lockfile`) are exposed via the `capture-env` subcommand if you need them in your own evals manifest. - [`local-postprocess-after-run.md`](local-postprocess-after-run.md) — walks `` for aggregation; cost reconciliation is one flavor of post-processing.