Recipe: pull artifacts, then post-process locally#

Pattern: runpod-deploy run writes pulled artifacts under {project_root}/artifacts/runpod/<timestamp>/. Plotting, aggregation, and report rendering happen after the run on the local machine, where they’re cheap and parallelizable. The pulled runpod_deploy_pull_manifest.json tells you exactly which run dir to point at.

Why this is a recipe, not a schema feature#

runpod-deploy does not run plot scripts or k-fold aggregations. Those are consumer-domain — they belong in your project’s Makefile or Python entry points, where they’re typed and testable.

Pattern (Makefile)#

.PHONY: headline-cloud headline-plots
headline-cloud:
	runpod-deploy run --config configs/runpod/headline.yaml

# Find the most recent run dir and render plots from it
LATEST_RUN_DIR := $(shell ls -dt artifacts/runpod/*/ | head -1)

headline-plots:
	uv run python scripts/render_plots.py --run-dir $(LATEST_RUN_DIR)
	runpod-deploy manifest-summary $(LATEST_RUN_DIR)/runpod_deploy_pull_manifest.json

make headline-cloud headline-plots chains them; or run them separately if you want to inspect artifacts before plotting.

Inspecting what came back#

runpod-deploy manifest-summary artifacts/runpod/20260514T120000Z/runpod_deploy_pull_manifest.json

Prints job name, run id, pod id, GPU, datacenter, wall time, captured $/hr price, estimated cost, deploy metadata (git SHA + lockfile hash), per-artifact pull status, and the list of telemetry files. Useful as a “did this run succeed” gate at the top of post-processing scripts.

Forensic deep-dive#

Each run dir also contains:

run.log — full remote stdout/stderr (always pulled when the run started, even on failure)
events.jsonl — orchestrator events (gpu selection, datacenter failover, artifact pull start/complete/fail, pod kill detection, optional __RUNPOD_STEP_*__ markers)
metrics.jsonl — periodic GPU/CPU/mem/disk samples (~one row per telemetry.sample_interval_sec)
nvidia_smi_{start,end}.txt, pod_describe_{start,end}.json, dmesg_tail.txt, pip_freeze.txt, remote_env.json

Walk these in your own analysis script when you need to reconstruct why a run behaved a certain way.

What lives where#

Concern	Owner
Pulling remote artifacts to the local run dir	`runpod-deploy run` (`artifacts[*]` config + `manifest.write_pull_manifest`)
Capturing `events.jsonl`, `metrics.jsonl`, `nvidia_smi_.txt`, `pod_describe_.json`	`runpod-deploy run` (`telemetry` module)
Walking the run-dir directory tree	Your driver (or `runpod_deploy.forensics.walk_run_dirs`)
Parsing `runpod_deploy_pull_manifest.json`	Your driver (or `runpod_deploy.forensics.load_manifest`)
Parsing `events.jsonl`	Your driver (or `runpod_deploy.forensics.load_events`)
Plotting / aggregation / metrics computation	Your post-processing code (consumer-domain)
Joining pulled artifacts with project-wide eval results	Your post-processing code

Anti-pattern to avoid#

Don’t push post-processing onto the pod (e.g. running plotting code inside run.body: or appending pandas aggregation to setup:). The pod is a deployment substrate, not a workflow runner — keep it single-responsibility (training, eval, generation). Local post-processing is fast (no SSH RTT), cheap (no GPU), and trivially re-runnable against pulled artifacts.

If your post-processing legitimately needs GPU compute, that’s a second job — a different runpod-deploy run config with its own artifacts, not a hidden second step inside the primary run.