Recipe: pull artifacts, then post-process locally#
Pattern: runpod-deploy run writes pulled artifacts under
{project_root}/artifacts/runpod/<timestamp>/. Plotting, aggregation,
and report rendering happen after the run on the local machine, where
they’re cheap and parallelizable. The pulled runpod_deploy_pull_manifest.json
tells you exactly which run dir to point at.
Why this is a recipe, not a schema feature#
runpod-deploy does not run plot scripts or k-fold aggregations.
Those are consumer-domain — they belong in your project’s Makefile or
Python entry points, where they’re typed and testable.
Pattern (Makefile)#
.PHONY: headline-cloud headline-plots
headline-cloud:
runpod-deploy run --config configs/runpod/headline.yaml
# Find the most recent run dir and render plots from it
LATEST_RUN_DIR := $(shell ls -dt artifacts/runpod/*/ | head -1)
headline-plots:
uv run python scripts/render_plots.py --run-dir $(LATEST_RUN_DIR)
runpod-deploy manifest-summary $(LATEST_RUN_DIR)/runpod_deploy_pull_manifest.json
make headline-cloud headline-plots chains them; or run them
separately if you want to inspect artifacts before plotting.
Inspecting what came back#
runpod-deploy manifest-summary artifacts/runpod/20260514T120000Z/runpod_deploy_pull_manifest.json
Prints job name, run id, pod id, GPU, datacenter, wall time, captured $/hr price, estimated cost, deploy metadata (git SHA + lockfile hash), per-artifact pull status, and the list of telemetry files. Useful as a “did this run succeed” gate at the top of post-processing scripts.
Forensic deep-dive#
Each run dir also contains:
run.log— full remote stdout/stderr (always pulled when the run started, even on failure)events.jsonl— orchestrator events (gpu selection, datacenter failover, artifact pull start/complete/fail, pod kill detection, optional__RUNPOD_STEP_*__markers)metrics.jsonl— periodic GPU/CPU/mem/disk samples (~one row pertelemetry.sample_interval_sec)nvidia_smi_{start,end}.txt,pod_describe_{start,end}.json,dmesg_tail.txt,pip_freeze.txt,remote_env.json
Walk these in your own analysis script when you need to reconstruct why a run behaved a certain way.
What lives where#
Concern |
Owner |
|---|---|
Pulling remote artifacts to the local run dir |
|
Capturing |
|
Walking the run-dir directory tree |
Your driver (or |
Parsing |
Your driver (or |
Parsing |
Your driver (or |
Plotting / aggregation / metrics computation |
Your post-processing code (consumer-domain) |
Joining pulled artifacts with project-wide eval results |
Your post-processing code |
Anti-pattern to avoid#
Don’t push post-processing onto the pod (e.g. running plotting code
inside run.body: or appending pandas aggregation to setup:). The
pod is a deployment substrate, not a workflow runner — keep it
single-responsibility (training, eval, generation). Local
post-processing is fast (no SSH RTT), cheap (no GPU), and trivially
re-runnable against pulled artifacts.
If your post-processing legitimately needs GPU compute, that’s a
second job — a different runpod-deploy run config with its own
artifacts, not a hidden second step inside the primary run.
See also#
local-preflight-then-run.md— the sibling pattern for pre-run audits; the same Makefile target often wires both.predictions-only-eval.md— the canonical case for pulling per-row outputs and doing all metrics CPU-side.cost-reconciliation.md— readsrunpod_deploy_pull_manifest.jsonfor cost/wall-time per run; pair with the artifact analysis above.For aggregate forensics across many runs:
runpod-deploy events-queryandmanifest-summary --root(seetroubleshooting.md“Forensic recovery”).