Recipe: pull artifacts, then post-process locally#

Pattern: runpod-deploy run writes pulled artifacts under {project_root}/artifacts/runpod/<timestamp>/. Plotting, aggregation, and report rendering happen after the run on the local machine, where they’re cheap and parallelizable. The pulled runpod_deploy_pull_manifest.json tells you exactly which run dir to point at.

Why this is a recipe, not a schema feature#

runpod-deploy does not run plot scripts or k-fold aggregations. Those are consumer-domain — they belong in your project’s Makefile or Python entry points, where they’re typed and testable.

Pattern (Makefile)#

.PHONY: headline-cloud headline-plots
headline-cloud:
	runpod-deploy run --config configs/runpod/headline.yaml

# Find the most recent run dir and render plots from it
LATEST_RUN_DIR := $(shell ls -dt artifacts/runpod/*/ | head -1)

headline-plots:
	uv run python scripts/render_plots.py --run-dir $(LATEST_RUN_DIR)
	runpod-deploy manifest-summary $(LATEST_RUN_DIR)/runpod_deploy_pull_manifest.json

make headline-cloud headline-plots chains them; or run them separately if you want to inspect artifacts before plotting.

Inspecting what came back#

runpod-deploy manifest-summary artifacts/runpod/20260514T120000Z/runpod_deploy_pull_manifest.json

Prints job name, run id, pod id, GPU, datacenter, wall time, captured $/hr price, estimated cost, deploy metadata (git SHA + lockfile hash), per-artifact pull status, and the list of telemetry files. Useful as a “did this run succeed” gate at the top of post-processing scripts.

Forensic deep-dive#

Each run dir also contains:

  • run.log — full remote stdout/stderr (always pulled when the run started, even on failure)

  • events.jsonl — orchestrator events (gpu selection, datacenter failover, artifact pull start/complete/fail, pod kill detection, optional __RUNPOD_STEP_*__ markers)

  • metrics.jsonl — periodic GPU/CPU/mem/disk samples (~one row per telemetry.sample_interval_sec)

  • nvidia_smi_{start,end}.txt, pod_describe_{start,end}.json, dmesg_tail.txt, pip_freeze.txt, remote_env.json

Walk these in your own analysis script when you need to reconstruct why a run behaved a certain way.

What lives where#

Concern

Owner

Pulling remote artifacts to the local run dir

runpod-deploy run (artifacts[*] config + manifest.write_pull_manifest)

Capturing events.jsonl, metrics.jsonl, nvidia_smi_*.txt, pod_describe_*.json

runpod-deploy run (telemetry module)

Walking the run-dir directory tree

Your driver (or runpod_deploy.forensics.walk_run_dirs)

Parsing runpod_deploy_pull_manifest.json

Your driver (or runpod_deploy.forensics.load_manifest)

Parsing events.jsonl

Your driver (or runpod_deploy.forensics.load_events)

Plotting / aggregation / metrics computation

Your post-processing code (consumer-domain)

Joining pulled artifacts with project-wide eval results

Your post-processing code

Anti-pattern to avoid#

Don’t push post-processing onto the pod (e.g. running plotting code inside run.body: or appending pandas aggregation to setup:). The pod is a deployment substrate, not a workflow runner — keep it single-responsibility (training, eval, generation). Local post-processing is fast (no SSH RTT), cheap (no GPU), and trivially re-runnable against pulled artifacts.

If your post-processing legitimately needs GPU compute, that’s a second job — a different runpod-deploy run config with its own artifacts, not a hidden second step inside the primary run.

See also#

  • local-preflight-then-run.md — the sibling pattern for pre-run audits; the same Makefile target often wires both.

  • predictions-only-eval.md — the canonical case for pulling per-row outputs and doing all metrics CPU-side.

  • cost-reconciliation.md — reads runpod_deploy_pull_manifest.json for cost/wall-time per run; pair with the artifact analysis above.

  • For aggregate forensics across many runs: runpod-deploy events-query and manifest-summary --root (see troubleshooting.md “Forensic recovery”).