Extending runpod-deploy#

Three audiences:

Consumers using runpod-deploy from their own repo (the common case). You shouldn’t need to fork or patch — the YAML schema + --var / --vars-file cover almost every config-driven variation, and recipes (docs/recipes/) cover composition with your own pre/post pipeline.
Library users importing from runpod_deploy import run_job into Python code (e.g., to wrap it inside a notebook or test).
Contributors adding a new feature, recipe, or PR to this repo.

1. Consumers — the no-fork path#

Vary one thing at a time → `--var KEY=VALUE`#

runpod-deploy run --config configs/runpod/template.yaml \
  --var seed=42 --var backbone=deberta --print-run-dir

KEY must be a valid Python identifier. VALUE is any string. Repeat --var to set multiple. CLI --var overrides any matching key in the YAML’s variables: block.

Vary many things → `--vars-file PATH`#

runpod-deploy run --config configs/runpod/template.yaml \
  --vars-file configs/runpod/sweep_seed42.json

PATH is a JSON object {KEY: VALUE} (all string values). CLI --var takes precedence over --vars-file entries on collision.

Per-shard cost cap + max runtime#

runpod-deploy run --config foo.yaml --cost-cap-usd 5.0 --max-runtime-minutes 60

Both override the YAML’s budget.cost_cap_usd and budget.max_runtime_minutes.

Pair GPU + DC for one-off runs#

runpod-deploy run --config foo.yaml \
  --gpu-id 'NVIDIA RTX 4090' --datacenter-id EU-RO-1

Short-circuits pod.gpu_order and pod.datacenters selection. Both flags must come together (validated at CLI parse time). Useful for “try this GPU class as a smoke test” without editing the YAML.

Filter GPUs by price#

runpod-deploy run --config foo.yaml --max-gpu-price 4.50

Skips GPUs above $4.50/hr during the failover loop. Reads RUNPOD_API_KEY for the GraphQL price fetch.

2. Library users — `from runpod_deploy import ...`#

When to use the Python API vs. the CLI#

runpod-deploy is primarily a CLI tool. Both known consumers today invoke it exclusively via Makefile (runpod-deploy run --config <yaml>); neither imports the Python package. For most use cases, prefer the CLI — it’s the documented happy path, the rev-rev surface for the maintainers, and the subprocess overhead is negligible against multi-minute GPU pod runtime.

The Python API earns its keep in four specific situations:

Use case	Why Python beats CLI	Symbols to use
Multi-manifest forensics	Analyzing N past runs at once with type-checked field access beats hand-rolling `json.loads()` + path-walking in bash	`walk_run_dirs`, `load_manifest`, `load_events` (see recipes/python-api-for-forensics.md)
Type-safe dynamic configs	Programs building or inspecting `RunpodJobSpec` objects (e.g., a Bayesian optimizer varying `gpu_order` beyond what `--var KEY=VALUE` expresses; a CI gate asserting on a loaded YAML)	`load_job_spec` + the `*Spec` dataclasses
Cost-prediction tooling	Dashboards or CI gates estimating spend before any pod is provisioned; no need for a subprocess just to call the GraphQL pricing layer	`fetch_gpu_prices`, `select_price_for_pod`, `GpuPrice`
Embedded orchestration	Calling `run_job()` from a larger Python platform (e.g., a web UI’s “Deploy” button or a multi-cloud orchestrator routing some jobs to RunPod)	`load_job_spec` + `run_job`

The Python API is not the right tool for:

In-process parallel sweeps — subprocess overhead is negligible vs. GPU minutes; the documented multi-config-sweep.md bash pattern with wait -n semaphore is simpler and wins on observability.
Direct construction of PodConnection / RemoteRunner / select_gpu_across_datacenters — these are low-level orchestration plumbing surfaces; the orchestrator wraps them. Consumers almost never need to call them directly.

See python-api-vs-cli.md for the full decision criterion and worked examples of each use case.

Public API surface#

Re-exported from __init__.py:

Symbol	What it is
`load_job_spec(path)`	Parses + validates a YAML config file. Returns a frozen `RunpodJobSpec`. Raises `ValueError` / `FileNotFoundError` / `KeyError` / `TypeError` on bad input.
`build_job_context(spec, config_path, *, cli_variables=None, timestamp=None)`	Resolves template variables and computes the run-dir path. Returns a frozen `JobContext` with `render(value)` and `render_path(value, base=...)`.
`validate_local_paths(ctx)`	Verifies each `local.required_paths` entry exists. Raises `FileNotFoundError` with the missing list.
`run_job(spec, *, config_path, dry_run=False, offline_dry_run=False, gpu_id_override=None, datacenter_id_override=None, max_gpu_price_usd=None, cli_variables=None, print_run_dir=False)`	Full lifecycle (provision → stage → preflight → run → pull → stop → manifest). All side-effecting calls. Raises on failure (after the `finally`-block manifest write).
`RunpodJobSpec`, `JobContext`, `PodSpec`, `StorageSpec`, `RunSpec`, … (frozen dataclasses)	Type-hint / pattern-match targets. Constructed via parsers; consumers should treat as read-only.
`select_gpu_across_datacenters`, `provision_pod`, `cleanup_pod`, `list_stale_pods`, `bulk_delete_pods`	Lower-level orchestration primitives (rarely needed; the orchestrator wraps them).
`pricing.fetch_gpu_prices(force_refresh=False)`	GraphQL pricing fetch. Returns `dict[str, GpuPrice]`.

Pattern: composing with a Python driver#

from runpod_deploy import load_job_spec, run_job

spec = load_job_spec("configs/runpod/template.yaml")

for seed in [42, 43, 44]:
    run_job(
        spec,
        config_path="configs/runpod/template.yaml",
        cli_variables={"seed": str(seed)},
        print_run_dir=True,
    )

This is the in-process equivalent of the bash sweep recipe. Failures raise; the caller decides retry/skip.

What you should NOT do#

Do not mutate RunpodJobSpec / JobContext — they’re frozen dataclasses. Construct new instances via dataclasses.replace if you need to alter a field (the CLI does this for budget overrides).
Do not import private helpers (anything prefixed with _). They’re not part of the public API and may change without notice.
Do not catch RemoteRunError to swallow it — it’s the carve-out for “SSH command failed” semantics. Catching means you’ve decided to handle a transient pod-side failure, which is your retry decision to make.

3. Contributors — the SRP boundary#

runpod-deploy is a deployment-primitives library. Its single responsibility is the pod lifecycle: GPU/DC selection, staging, remote execution, telemetry, artifact pull, manifest. See CLAUDE.md §1–§16 for the full operational standards.

What we own:

Anything pod / GPU / cost / telemetry / manifest.
The YAML schema + the validator + the renderer.
The forensic CLIs that read manifests + events.
Recipes that document composition patterns.

What we don’t own:

Consumer-domain logic (audit, plotting, aggregation, metrics).
Multi-process workflow orchestration (sweep drivers belong in bash/Make/Python in the consumer repo; see docs/recipes/multi-config-sweep.md).
Any feature that would run consumer Python or bash locally.

When you have a feature in mind, ask: is this metadata about the deploy, or is it consumer-domain logic? Former → ship as a schema feature or CLI subcommand. Latter → ship as a recipe under docs/recipes/.

Contributing a recipe#

A recipe is a markdown file under docs/recipes/ documenting a composition pattern. Pattern:

Filename — kebab-case, descriptive verb-phrase (local-preflight-then-run.md, cost-reconciliation.md).
Structure — top-level # Recipe: <name> heading; a **Pattern:** one-liner explaining the use case; one or more code blocks showing the implementation; a ## Notes or ## Pitfalls section if non-obvious; a ## See also linking related recipes.
Index — add a one-line entry to docs/recipes/README.md under the appropriate section.
Code blocks must be valid — for runpod-deploy YAML blocks, tests/test_recipe_examples.py (added in v0.7) loads them via load_job_spec. Keep them schema-correct.

Contributing a schema feature#

A schema feature is a new YAML field or top-level section.

Field on a dataclass — add to the appropriate *Spec in src/runpod_deploy/config.py. Provide a sane falsy default (per the additive-change policy in MIGRATION.md).
Validation in __post_init__ — raise stdlib exceptions with diagnostic messages (CLAUDE.md §6).
Parser recognizes the new key — update the matching _parse_* in src/runpod_deploy/_config_parsers.py (add to _check_keys allowed set + extract value).
Use site renders or consumes — pure functions if possible; IO-touching code lives in orchestrator/provider/transport.
Tests — tests/test_config.py for parse + validation; tests/test_orchestrator*.py for use-site behavior (use the --offline-dry-run pattern).
Doc — update docs/config-reference.md with the new field
- an example. Update MIGRATION.md “Additive changes since schema_version: 2” with a one-bullet summary.
CHANGELOG — entry under [Unreleased] per Keep-a-Changelog.

Contributing a CLI subcommand#

Subparser — add to _build_parser in src/runpod_deploy/cli.py (after the existing subparsers; sort by ship order, not alphabetic).
Handler — _cmd_<name> function returning an exit code (0 on success, non-zero on failure-modes the caller may want to distinguish).
Dispatch — add to the handlers dict in main.
Tests — tests/test_cli_<name>.py. Use capsys for stdout assertions, caplog for logger assertions.
Doc — note in docs/quickstart.md or docs/troubleshooting.md if it’s a forensic tool worth highlighting; in the --help message regardless.
CHANGELOG + MIGRATION — additive CLI changes go under MIGRATION.md “CLI additions” section + CHANGELOG [Unreleased].

Coding standards#

See CLAUDE.md (the canonical doc at repo root). Summary:

Black, line length 100. ruff for linting. mypy strict.
Frozen slotted dataclasses for all value types.
Stdlib exceptions; no Result[T, Error] patterns.
from __future__ import annotations at the top of every module except __init__.py.
Module-level docstring on every module; one-line docstring on every __all__ function.
80%+ unit coverage; coverage gate at floor(actual) − 5, bumped per-PR.
Real tests; no stubs / TODOs / placeholders.

Updating golden CLI files#

The v0.7 release introduces golden-file snapshot tests for CLI output stability. When you intentionally change a CLI’s output format:

pytest tests/test_cli_golden.py --update-goldens
git diff tests/fixtures/golden/   # eyeball the diff
git add tests/fixtures/golden/*.txt

Review the diff carefully — these files lock the UX contract.