# Extending runpod-deploy Three audiences: 1. **Consumers** using runpod-deploy from their own repo (the common case). You shouldn't need to fork or patch — the YAML schema + `--var` / `--vars-file` cover almost every config-driven variation, and recipes (`docs/recipes/`) cover composition with your own pre/post pipeline. 2. **Library users** importing `from runpod_deploy import run_job` into Python code (e.g., to wrap it inside a notebook or test). 3. **Contributors** adding a new feature, recipe, or PR to this repo. --- ## 1. Consumers — the no-fork path ### Vary one thing at a time → `--var KEY=VALUE` ```sh runpod-deploy run --config configs/runpod/template.yaml \ --var seed=42 --var backbone=deberta --print-run-dir ``` KEY must be a valid Python identifier. VALUE is any string. Repeat `--var` to set multiple. CLI `--var` overrides any matching key in the YAML's `variables:` block. ### Vary many things → `--vars-file PATH` ```sh runpod-deploy run --config configs/runpod/template.yaml \ --vars-file configs/runpod/sweep_seed42.json ``` PATH is a JSON object `{KEY: VALUE}` (all string values). CLI `--var` takes precedence over `--vars-file` entries on collision. ### Per-shard cost cap + max runtime ```sh runpod-deploy run --config foo.yaml --cost-cap-usd 5.0 --max-runtime-minutes 60 ``` Both override the YAML's `budget.cost_cap_usd` and `budget.max_runtime_minutes`. ### Pair GPU + DC for one-off runs ```sh runpod-deploy run --config foo.yaml \ --gpu-id 'NVIDIA RTX 4090' --datacenter-id EU-RO-1 ``` Short-circuits `pod.gpu_order` and `pod.datacenters` selection. Both flags must come together (validated at CLI parse time). Useful for "try this GPU class as a smoke test" without editing the YAML. ### Filter GPUs by price ```sh runpod-deploy run --config foo.yaml --max-gpu-price 4.50 ``` Skips GPUs above $4.50/hr during the failover loop. Reads `RUNPOD_API_KEY` for the GraphQL price fetch. --- ## 2. Library users — `from runpod_deploy import ...` ### When to use the Python API vs. the CLI `runpod-deploy` is primarily a CLI tool. Both known consumers today invoke it exclusively via Makefile (`runpod-deploy run --config `); neither imports the Python package. For most use cases, **prefer the CLI** — it's the documented happy path, the rev-rev surface for the maintainers, and the subprocess overhead is negligible against multi-minute GPU pod runtime. The Python API earns its keep in four specific situations: | Use case | Why Python beats CLI | Symbols to use | |---|---|---| | **Multi-manifest forensics** | Analyzing N past runs at once with type-checked field access beats hand-rolling `json.loads()` + path-walking in bash | `walk_run_dirs`, `load_manifest`, `load_events` (see [recipes/python-api-for-forensics.md](recipes/python-api-for-forensics.md)) | | **Type-safe dynamic configs** | Programs building or inspecting `RunpodJobSpec` objects (e.g., a Bayesian optimizer varying `gpu_order` beyond what `--var KEY=VALUE` expresses; a CI gate asserting on a loaded YAML) | `load_job_spec` + the `*Spec` dataclasses | | **Cost-prediction tooling** | Dashboards or CI gates estimating spend before any pod is provisioned; no need for a subprocess just to call the GraphQL pricing layer | `fetch_gpu_prices`, `select_price_for_pod`, `GpuPrice` | | **Embedded orchestration** | Calling `run_job()` from a larger Python platform (e.g., a web UI's "Deploy" button or a multi-cloud orchestrator routing some jobs to RunPod) | `load_job_spec` + `run_job` | The Python API is **not** the right tool for: - *In-process parallel sweeps* — subprocess overhead is negligible vs. GPU minutes; the documented [`multi-config-sweep.md`](recipes/multi-config-sweep.md) bash pattern with `wait -n` semaphore is simpler and wins on observability. - *Direct construction of `PodConnection` / `RemoteRunner` / `select_gpu_across_datacenters`* — these are low-level orchestration plumbing surfaces; the orchestrator wraps them. Consumers almost never need to call them directly. See [`python-api-vs-cli.md`](python-api-vs-cli.md) for the full decision criterion and worked examples of each use case. ### Public API surface Re-exported from `__init__.py`: | Symbol | What it is | |---|---| | `load_job_spec(path)` | Parses + validates a YAML config file. Returns a frozen `RunpodJobSpec`. Raises `ValueError` / `FileNotFoundError` / `KeyError` / `TypeError` on bad input. | | `build_job_context(spec, config_path, *, cli_variables=None, timestamp=None)` | Resolves template variables and computes the run-dir path. Returns a frozen `JobContext` with `render(value)` and `render_path(value, base=...)`. | | `validate_local_paths(ctx)` | Verifies each `local.required_paths` entry exists. Raises `FileNotFoundError` with the missing list. | | `run_job(spec, *, config_path, dry_run=False, offline_dry_run=False, gpu_id_override=None, datacenter_id_override=None, max_gpu_price_usd=None, cli_variables=None, print_run_dir=False)` | Full lifecycle (provision → stage → preflight → run → pull → stop → manifest). All side-effecting calls. Raises on failure (after the `finally`-block manifest write). | | `RunpodJobSpec`, `JobContext`, `PodSpec`, `StorageSpec`, `RunSpec`, ... (frozen dataclasses) | Type-hint / pattern-match targets. Constructed via parsers; consumers should treat as read-only. | | `select_gpu_across_datacenters`, `provision_pod`, `cleanup_pod`, `list_stale_pods`, `bulk_delete_pods` | Lower-level orchestration primitives (rarely needed; the orchestrator wraps them). | | `pricing.fetch_gpu_prices(force_refresh=False)` | GraphQL pricing fetch. Returns `dict[str, GpuPrice]`. | ### Pattern: composing with a Python driver ```python notest from runpod_deploy import load_job_spec, run_job spec = load_job_spec("configs/runpod/template.yaml") for seed in [42, 43, 44]: run_job( spec, config_path="configs/runpod/template.yaml", cli_variables={"seed": str(seed)}, print_run_dir=True, ) ``` This is the in-process equivalent of the bash sweep recipe. Failures raise; the caller decides retry/skip. ### What you should NOT do - **Do not mutate `RunpodJobSpec` / `JobContext`** — they're frozen dataclasses. Construct new instances via `dataclasses.replace` if you need to alter a field (the CLI does this for budget overrides). - **Do not import private helpers** (anything prefixed with `_`). They're not part of the public API and may change without notice. - **Do not catch `RemoteRunError` to swallow it** — it's the carve-out for "SSH command failed" semantics. Catching means you've decided to handle a transient pod-side failure, which is your retry decision to make. --- ## 3. Contributors — the SRP boundary runpod-deploy is a **deployment-primitives library**. Its single responsibility is the pod lifecycle: GPU/DC selection, staging, remote execution, telemetry, artifact pull, manifest. See CLAUDE.md §1–§16 for the full operational standards. **What we own**: - Anything pod / GPU / cost / telemetry / manifest. - The YAML schema + the validator + the renderer. - The forensic CLIs that read manifests + events. - Recipes that document composition patterns. **What we don't own**: - Consumer-domain logic (audit, plotting, aggregation, metrics). - Multi-process workflow orchestration (sweep drivers belong in bash/Make/Python in the consumer repo; see `docs/recipes/multi-config-sweep.md`). - Any feature that would run consumer Python or bash locally. When you have a feature in mind, ask: *is this metadata about the deploy, or is it consumer-domain logic?* Former → ship as a schema feature or CLI subcommand. Latter → ship as a recipe under `docs/recipes/`. ### Contributing a recipe A recipe is a markdown file under `docs/recipes/` documenting a composition pattern. Pattern: 1. **Filename** — kebab-case, descriptive verb-phrase (`local-preflight-then-run.md`, `cost-reconciliation.md`). 2. **Structure** — top-level `# Recipe: ` heading; a `**Pattern:**` one-liner explaining the use case; one or more code blocks showing the implementation; a `## Notes` or `## Pitfalls` section if non-obvious; a `## See also` linking related recipes. 3. **Index** — add a one-line entry to `docs/recipes/README.md` under the appropriate section. 4. **Code blocks must be valid** — for runpod-deploy YAML blocks, `tests/test_recipe_examples.py` (added in v0.7) loads them via `load_job_spec`. Keep them schema-correct. ### Contributing a schema feature A schema feature is a new YAML field or top-level section. 1. **Field on a dataclass** — add to the appropriate `*Spec` in `src/runpod_deploy/config.py`. Provide a sane falsy default (per the additive-change policy in `MIGRATION.md`). 2. **Validation in `__post_init__`** — raise stdlib exceptions with diagnostic messages (CLAUDE.md §6). 3. **Parser recognizes the new key** — update the matching `_parse_*` in `src/runpod_deploy/_config_parsers.py` (add to `_check_keys` allowed set + extract value). 4. **Use site renders or consumes** — pure functions if possible; IO-touching code lives in orchestrator/provider/transport. 5. **Tests** — `tests/test_config.py` for parse + validation; `tests/test_orchestrator*.py` for use-site behavior (use the `--offline-dry-run` pattern). 6. **Doc** — update `docs/config-reference.md` with the new field + an example. Update `MIGRATION.md` "Additive changes since schema_version: 2" with a one-bullet summary. 7. **CHANGELOG** — entry under `[Unreleased]` per Keep-a-Changelog. ### Contributing a CLI subcommand 1. **Subparser** — add to `_build_parser` in `src/runpod_deploy/cli.py` (after the existing subparsers; sort by ship order, not alphabetic). 2. **Handler** — `_cmd_` function returning an exit code (0 on success, non-zero on failure-modes the caller may want to distinguish). 3. **Dispatch** — add to the `handlers` dict in `main`. 4. **Tests** — `tests/test_cli_.py`. Use `capsys` for stdout assertions, `caplog` for logger assertions. 5. **Doc** — note in `docs/quickstart.md` or `docs/troubleshooting.md` if it's a forensic tool worth highlighting; in the `--help` message regardless. 6. **CHANGELOG + MIGRATION** — additive CLI changes go under `MIGRATION.md` "CLI additions" section + CHANGELOG `[Unreleased]`. ### Coding standards See CLAUDE.md (the canonical doc at repo root). Summary: - Black, line length 100. ruff for linting. mypy strict. - Frozen slotted dataclasses for all value types. - Stdlib exceptions; no `Result[T, Error]` patterns. - `from __future__ import annotations` at the top of every module except `__init__.py`. - Module-level docstring on every module; one-line docstring on every `__all__` function. - 80%+ unit coverage; coverage gate at `floor(actual) − 5`, bumped per-PR. - Real tests; no stubs / TODOs / placeholders. ### Updating golden CLI files The v0.7 release introduces golden-file snapshot tests for CLI output stability. When you intentionally change a CLI's output format: ```sh pytest tests/test_cli_golden.py --update-goldens git diff tests/fixtures/golden/ # eyeball the diff git add tests/fixtures/golden/*.txt ``` Review the diff carefully — these files lock the UX contract.