Extending runpod-deploy#

Three audiences:

  1. Consumers using runpod-deploy from their own repo (the common case). You shouldn’t need to fork or patch — the YAML schema + --var / --vars-file cover almost every config-driven variation, and recipes (docs/recipes/) cover composition with your own pre/post pipeline.

  2. Library users importing from runpod_deploy import run_job into Python code (e.g., to wrap it inside a notebook or test).

  3. Contributors adding a new feature, recipe, or PR to this repo.


1. Consumers — the no-fork path#

Vary one thing at a time → --var KEY=VALUE#

runpod-deploy run --config configs/runpod/template.yaml \
  --var seed=42 --var backbone=deberta --print-run-dir

KEY must be a valid Python identifier. VALUE is any string. Repeat --var to set multiple. CLI --var overrides any matching key in the YAML’s variables: block.

Vary many things → --vars-file PATH#

runpod-deploy run --config configs/runpod/template.yaml \
  --vars-file configs/runpod/sweep_seed42.json

PATH is a JSON object {KEY: VALUE} (all string values). CLI --var takes precedence over --vars-file entries on collision.

Per-shard cost cap + max runtime#

runpod-deploy run --config foo.yaml --cost-cap-usd 5.0 --max-runtime-minutes 60

Both override the YAML’s budget.cost_cap_usd and budget.max_runtime_minutes.

Pair GPU + DC for one-off runs#

runpod-deploy run --config foo.yaml \
  --gpu-id 'NVIDIA RTX 4090' --datacenter-id EU-RO-1

Short-circuits pod.gpu_order and pod.datacenters selection. Both flags must come together (validated at CLI parse time). Useful for “try this GPU class as a smoke test” without editing the YAML.

Filter GPUs by price#

runpod-deploy run --config foo.yaml --max-gpu-price 4.50

Skips GPUs above $4.50/hr during the failover loop. Reads RUNPOD_API_KEY for the GraphQL price fetch.


2. Library users — from runpod_deploy import ...#

When to use the Python API vs. the CLI#

runpod-deploy is primarily a CLI tool. Both known consumers today invoke it exclusively via Makefile (runpod-deploy run --config <yaml>); neither imports the Python package. For most use cases, prefer the CLI — it’s the documented happy path, the rev-rev surface for the maintainers, and the subprocess overhead is negligible against multi-minute GPU pod runtime.

The Python API earns its keep in four specific situations:

Use case

Why Python beats CLI

Symbols to use

Multi-manifest forensics

Analyzing N past runs at once with type-checked field access beats hand-rolling json.loads() + path-walking in bash

walk_run_dirs, load_manifest, load_events (see recipes/python-api-for-forensics.md)

Type-safe dynamic configs

Programs building or inspecting RunpodJobSpec objects (e.g., a Bayesian optimizer varying gpu_order beyond what --var KEY=VALUE expresses; a CI gate asserting on a loaded YAML)

load_job_spec + the *Spec dataclasses

Cost-prediction tooling

Dashboards or CI gates estimating spend before any pod is provisioned; no need for a subprocess just to call the GraphQL pricing layer

fetch_gpu_prices, select_price_for_pod, GpuPrice

Embedded orchestration

Calling run_job() from a larger Python platform (e.g., a web UI’s “Deploy” button or a multi-cloud orchestrator routing some jobs to RunPod)

load_job_spec + run_job

The Python API is not the right tool for:

  • In-process parallel sweeps — subprocess overhead is negligible vs. GPU minutes; the documented multi-config-sweep.md bash pattern with wait -n semaphore is simpler and wins on observability.

  • Direct construction of PodConnection / RemoteRunner / select_gpu_across_datacenters — these are low-level orchestration plumbing surfaces; the orchestrator wraps them. Consumers almost never need to call them directly.

See python-api-vs-cli.md for the full decision criterion and worked examples of each use case.

Public API surface#

Re-exported from __init__.py:

Symbol

What it is

load_job_spec(path)

Parses + validates a YAML config file. Returns a frozen RunpodJobSpec. Raises ValueError / FileNotFoundError / KeyError / TypeError on bad input.

build_job_context(spec, config_path, *, cli_variables=None, timestamp=None)

Resolves template variables and computes the run-dir path. Returns a frozen JobContext with render(value) and render_path(value, base=...).

validate_local_paths(ctx)

Verifies each local.required_paths entry exists. Raises FileNotFoundError with the missing list.

run_job(spec, *, config_path, dry_run=False, offline_dry_run=False, gpu_id_override=None, datacenter_id_override=None, max_gpu_price_usd=None, cli_variables=None, print_run_dir=False)

Full lifecycle (provision → stage → preflight → run → pull → stop → manifest). All side-effecting calls. Raises on failure (after the finally-block manifest write).

RunpodJobSpec, JobContext, PodSpec, StorageSpec, RunSpec, … (frozen dataclasses)

Type-hint / pattern-match targets. Constructed via parsers; consumers should treat as read-only.

select_gpu_across_datacenters, provision_pod, cleanup_pod, list_stale_pods, bulk_delete_pods

Lower-level orchestration primitives (rarely needed; the orchestrator wraps them).

pricing.fetch_gpu_prices(force_refresh=False)

GraphQL pricing fetch. Returns dict[str, GpuPrice].

Pattern: composing with a Python driver#

from runpod_deploy import load_job_spec, run_job

spec = load_job_spec("configs/runpod/template.yaml")

for seed in [42, 43, 44]:
    run_job(
        spec,
        config_path="configs/runpod/template.yaml",
        cli_variables={"seed": str(seed)},
        print_run_dir=True,
    )

This is the in-process equivalent of the bash sweep recipe. Failures raise; the caller decides retry/skip.

What you should NOT do#

  • Do not mutate RunpodJobSpec / JobContext — they’re frozen dataclasses. Construct new instances via dataclasses.replace if you need to alter a field (the CLI does this for budget overrides).

  • Do not import private helpers (anything prefixed with _). They’re not part of the public API and may change without notice.

  • Do not catch RemoteRunError to swallow it — it’s the carve-out for “SSH command failed” semantics. Catching means you’ve decided to handle a transient pod-side failure, which is your retry decision to make.


3. Contributors — the SRP boundary#

runpod-deploy is a deployment-primitives library. Its single responsibility is the pod lifecycle: GPU/DC selection, staging, remote execution, telemetry, artifact pull, manifest. See CLAUDE.md §1–§16 for the full operational standards.

What we own:

  • Anything pod / GPU / cost / telemetry / manifest.

  • The YAML schema + the validator + the renderer.

  • The forensic CLIs that read manifests + events.

  • Recipes that document composition patterns.

What we don’t own:

  • Consumer-domain logic (audit, plotting, aggregation, metrics).

  • Multi-process workflow orchestration (sweep drivers belong in bash/Make/Python in the consumer repo; see docs/recipes/multi-config-sweep.md).

  • Any feature that would run consumer Python or bash locally.

When you have a feature in mind, ask: is this metadata about the deploy, or is it consumer-domain logic? Former → ship as a schema feature or CLI subcommand. Latter → ship as a recipe under docs/recipes/.

Contributing a recipe#

A recipe is a markdown file under docs/recipes/ documenting a composition pattern. Pattern:

  1. Filename — kebab-case, descriptive verb-phrase (local-preflight-then-run.md, cost-reconciliation.md).

  2. Structure — top-level # Recipe: <name> heading; a **Pattern:** one-liner explaining the use case; one or more code blocks showing the implementation; a ## Notes or ## Pitfalls section if non-obvious; a ## See also linking related recipes.

  3. Index — add a one-line entry to docs/recipes/README.md under the appropriate section.

  4. Code blocks must be valid — for runpod-deploy YAML blocks, tests/test_recipe_examples.py (added in v0.7) loads them via load_job_spec. Keep them schema-correct.

Contributing a schema feature#

A schema feature is a new YAML field or top-level section.

  1. Field on a dataclass — add to the appropriate *Spec in src/runpod_deploy/config.py. Provide a sane falsy default (per the additive-change policy in MIGRATION.md).

  2. Validation in __post_init__ — raise stdlib exceptions with diagnostic messages (CLAUDE.md §6).

  3. Parser recognizes the new key — update the matching _parse_* in src/runpod_deploy/_config_parsers.py (add to _check_keys allowed set + extract value).

  4. Use site renders or consumes — pure functions if possible; IO-touching code lives in orchestrator/provider/transport.

  5. Teststests/test_config.py for parse + validation; tests/test_orchestrator*.py for use-site behavior (use the --offline-dry-run pattern).

  6. Doc — update docs/config-reference.md with the new field

    • an example. Update MIGRATION.md “Additive changes since schema_version: 2” with a one-bullet summary.

  7. CHANGELOG — entry under [Unreleased] per Keep-a-Changelog.

Contributing a CLI subcommand#

  1. Subparser — add to _build_parser in src/runpod_deploy/cli.py (after the existing subparsers; sort by ship order, not alphabetic).

  2. Handler_cmd_<name> function returning an exit code (0 on success, non-zero on failure-modes the caller may want to distinguish).

  3. Dispatch — add to the handlers dict in main.

  4. Teststests/test_cli_<name>.py. Use capsys for stdout assertions, caplog for logger assertions.

  5. Doc — note in docs/quickstart.md or docs/troubleshooting.md if it’s a forensic tool worth highlighting; in the --help message regardless.

  6. CHANGELOG + MIGRATION — additive CLI changes go under MIGRATION.md “CLI additions” section + CHANGELOG [Unreleased].

Coding standards#

See CLAUDE.md (the canonical doc at repo root). Summary:

  • Black, line length 100. ruff for linting. mypy strict.

  • Frozen slotted dataclasses for all value types.

  • Stdlib exceptions; no Result[T, Error] patterns.

  • from __future__ import annotations at the top of every module except __init__.py.

  • Module-level docstring on every module; one-line docstring on every __all__ function.

  • 80%+ unit coverage; coverage gate at floor(actual) 5, bumped per-PR.

  • Real tests; no stubs / TODOs / placeholders.

Updating golden CLI files#

The v0.7 release introduces golden-file snapshot tests for CLI output stability. When you intentionally change a CLI’s output format:

pytest tests/test_cli_golden.py --update-goldens
git diff tests/fixtures/golden/   # eyeball the diff
git add tests/fixtures/golden/*.txt

Review the diff carefully — these files lock the UX contract.