Extending runpod-deploy#
Three audiences:
Consumers using runpod-deploy from their own repo (the common case). You shouldn’t need to fork or patch — the YAML schema +
--var/--vars-filecover almost every config-driven variation, and recipes (docs/recipes/) cover composition with your own pre/post pipeline.Library users importing
from runpod_deploy import run_jobinto Python code (e.g., to wrap it inside a notebook or test).Contributors adding a new feature, recipe, or PR to this repo.
1. Consumers — the no-fork path#
Vary one thing at a time → --var KEY=VALUE#
runpod-deploy run --config configs/runpod/template.yaml \
--var seed=42 --var backbone=deberta --print-run-dir
KEY must be a valid Python identifier. VALUE is any string. Repeat
--var to set multiple. CLI --var overrides any matching key in
the YAML’s variables: block.
Vary many things → --vars-file PATH#
runpod-deploy run --config configs/runpod/template.yaml \
--vars-file configs/runpod/sweep_seed42.json
PATH is a JSON object {KEY: VALUE} (all string values). CLI --var
takes precedence over --vars-file entries on collision.
Per-shard cost cap + max runtime#
runpod-deploy run --config foo.yaml --cost-cap-usd 5.0 --max-runtime-minutes 60
Both override the YAML’s budget.cost_cap_usd and
budget.max_runtime_minutes.
Pair GPU + DC for one-off runs#
runpod-deploy run --config foo.yaml \
--gpu-id 'NVIDIA RTX 4090' --datacenter-id EU-RO-1
Short-circuits pod.gpu_order and pod.datacenters selection. Both
flags must come together (validated at CLI parse time). Useful for
“try this GPU class as a smoke test” without editing the YAML.
Filter GPUs by price#
runpod-deploy run --config foo.yaml --max-gpu-price 4.50
Skips GPUs above $4.50/hr during the failover loop. Reads
RUNPOD_API_KEY for the GraphQL price fetch.
2. Library users — from runpod_deploy import ...#
When to use the Python API vs. the CLI#
runpod-deploy is primarily a CLI tool. Both known consumers today
invoke it exclusively via Makefile (runpod-deploy run --config <yaml>); neither imports the Python package. For most use cases,
prefer the CLI — it’s the documented happy path, the rev-rev
surface for the maintainers, and the subprocess overhead is
negligible against multi-minute GPU pod runtime.
The Python API earns its keep in four specific situations:
Use case |
Why Python beats CLI |
Symbols to use |
|---|---|---|
Multi-manifest forensics |
Analyzing N past runs at once with type-checked field access beats hand-rolling |
|
Type-safe dynamic configs |
Programs building or inspecting |
|
Cost-prediction tooling |
Dashboards or CI gates estimating spend before any pod is provisioned; no need for a subprocess just to call the GraphQL pricing layer |
|
Embedded orchestration |
Calling |
|
The Python API is not the right tool for:
In-process parallel sweeps — subprocess overhead is negligible vs. GPU minutes; the documented
multi-config-sweep.mdbash pattern withwait -nsemaphore is simpler and wins on observability.Direct construction of
PodConnection/RemoteRunner/select_gpu_across_datacenters— these are low-level orchestration plumbing surfaces; the orchestrator wraps them. Consumers almost never need to call them directly.
See python-api-vs-cli.md for the full
decision criterion and worked examples of each use case.
Public API surface#
Re-exported from __init__.py:
Symbol |
What it is |
|---|---|
|
Parses + validates a YAML config file. Returns a frozen |
|
Resolves template variables and computes the run-dir path. Returns a frozen |
|
Verifies each |
|
Full lifecycle (provision → stage → preflight → run → pull → stop → manifest). All side-effecting calls. Raises on failure (after the |
|
Type-hint / pattern-match targets. Constructed via parsers; consumers should treat as read-only. |
|
Lower-level orchestration primitives (rarely needed; the orchestrator wraps them). |
|
GraphQL pricing fetch. Returns |
Pattern: composing with a Python driver#
from runpod_deploy import load_job_spec, run_job
spec = load_job_spec("configs/runpod/template.yaml")
for seed in [42, 43, 44]:
run_job(
spec,
config_path="configs/runpod/template.yaml",
cli_variables={"seed": str(seed)},
print_run_dir=True,
)
This is the in-process equivalent of the bash sweep recipe. Failures raise; the caller decides retry/skip.
What you should NOT do#
Do not mutate
RunpodJobSpec/JobContext— they’re frozen dataclasses. Construct new instances viadataclasses.replaceif you need to alter a field (the CLI does this for budget overrides).Do not import private helpers (anything prefixed with
_). They’re not part of the public API and may change without notice.Do not catch
RemoteRunErrorto swallow it — it’s the carve-out for “SSH command failed” semantics. Catching means you’ve decided to handle a transient pod-side failure, which is your retry decision to make.
3. Contributors — the SRP boundary#
runpod-deploy is a deployment-primitives library. Its single responsibility is the pod lifecycle: GPU/DC selection, staging, remote execution, telemetry, artifact pull, manifest. See CLAUDE.md §1–§16 for the full operational standards.
What we own:
Anything pod / GPU / cost / telemetry / manifest.
The YAML schema + the validator + the renderer.
The forensic CLIs that read manifests + events.
Recipes that document composition patterns.
What we don’t own:
Consumer-domain logic (audit, plotting, aggregation, metrics).
Multi-process workflow orchestration (sweep drivers belong in bash/Make/Python in the consumer repo; see
docs/recipes/multi-config-sweep.md).Any feature that would run consumer Python or bash locally.
When you have a feature in mind, ask: is this metadata about the
deploy, or is it consumer-domain logic? Former → ship as a schema
feature or CLI subcommand. Latter → ship as a recipe under
docs/recipes/.
Contributing a recipe#
A recipe is a markdown file under docs/recipes/ documenting a
composition pattern. Pattern:
Filename — kebab-case, descriptive verb-phrase (
local-preflight-then-run.md,cost-reconciliation.md).Structure — top-level
# Recipe: <name>heading; a**Pattern:**one-liner explaining the use case; one or more code blocks showing the implementation; a## Notesor## Pitfallssection if non-obvious; a## See alsolinking related recipes.Index — add a one-line entry to
docs/recipes/README.mdunder the appropriate section.Code blocks must be valid — for runpod-deploy YAML blocks,
tests/test_recipe_examples.py(added in v0.7) loads them viaload_job_spec. Keep them schema-correct.
Contributing a schema feature#
A schema feature is a new YAML field or top-level section.
Field on a dataclass — add to the appropriate
*Specinsrc/runpod_deploy/config.py. Provide a sane falsy default (per the additive-change policy inMIGRATION.md).Validation in
__post_init__— raise stdlib exceptions with diagnostic messages (CLAUDE.md §6).Parser recognizes the new key — update the matching
_parse_*insrc/runpod_deploy/_config_parsers.py(add to_check_keysallowed set + extract value).Use site renders or consumes — pure functions if possible; IO-touching code lives in orchestrator/provider/transport.
Tests —
tests/test_config.pyfor parse + validation;tests/test_orchestrator*.pyfor use-site behavior (use the--offline-dry-runpattern).Doc — update
docs/config-reference.mdwith the new fieldan example. Update
MIGRATION.md“Additive changes since schema_version: 2” with a one-bullet summary.
CHANGELOG — entry under
[Unreleased]per Keep-a-Changelog.
Contributing a CLI subcommand#
Subparser — add to
_build_parserinsrc/runpod_deploy/cli.py(after the existing subparsers; sort by ship order, not alphabetic).Handler —
_cmd_<name>function returning an exit code (0 on success, non-zero on failure-modes the caller may want to distinguish).Dispatch — add to the
handlersdict inmain.Tests —
tests/test_cli_<name>.py. Usecapsysfor stdout assertions,caplogfor logger assertions.Doc — note in
docs/quickstart.mdordocs/troubleshooting.mdif it’s a forensic tool worth highlighting; in the--helpmessage regardless.CHANGELOG + MIGRATION — additive CLI changes go under
MIGRATION.md“CLI additions” section + CHANGELOG[Unreleased].
Coding standards#
See CLAUDE.md (the canonical doc at repo root). Summary:
Black, line length 100. ruff for linting. mypy strict.
Frozen slotted dataclasses for all value types.
Stdlib exceptions; no
Result[T, Error]patterns.from __future__ import annotationsat the top of every module except__init__.py.Module-level docstring on every module; one-line docstring on every
__all__function.80%+ unit coverage; coverage gate at
floor(actual) − 5, bumped per-PR.Real tests; no stubs / TODOs / placeholders.
Updating golden CLI files#
The v0.7 release introduces golden-file snapshot tests for CLI output stability. When you intentionally change a CLI’s output format:
pytest tests/test_cli_golden.py --update-goldens
git diff tests/fixtures/golden/ # eyeball the diff
git add tests/fixtures/golden/*.txt
Review the diff carefully — these files lock the UX contract.