# Quickstart 5 minutes from `git clone` to your first `runpod-deploy run`. This walks through the one-time host setup, the anatomy of a YAML config, and the `--offline-dry-run` mental model. For the full pipeline walk, see [`lifecycle.md`](lifecycle.md). --- ## 0. Prerequisites - **Python ≥ 3.13** (declared in `pyproject.toml`'s `requires-python`). - **`uv`** for environment management: `curl -LsSf https://astral.sh/uv/install.sh | sh`. - **`runpodctl`** (the official CLI you're orchestrating around): download from [GitHub releases](https://github.com/runpod/runpodctl/releases) and put on `$PATH`. Authenticate with `runpodctl config --apiKey `. - **`rsync`** ≥ 3.x on your local machine (macOS ships a 2.x by default; install via Homebrew if needed). - An SSH key registered with RunPod (paste your public key into the RunPod web UI → Settings → SSH). --- ## 1. Install runpod-deploy ```bash pip install runpod-deploy ``` Or with `uv`: ```bash uv pip install runpod-deploy ``` Verify: ```bash runpod-deploy --help ``` You should see the subcommand list including `validate`, `run`, `gpu-list`, `manifest-summary`, `events-query`, ... > **For contributors**: see [`CONTRIBUTING.md`](https://github.com/brandon-behring/runpod-deploy/blob/main/CONTRIBUTING.md) for > the editable-install flow (`git clone` + `uv pip install -e ".[dev]"`), > pre-commit setup, and the contribution guidelines. --- ## 2. Inspect a working example The repo ships a minimal smoke config: ```bash cat examples/smoke/a4000_smoke.yaml ``` It's annotated, but the structure is: ```yaml schema_version: 2 # required; locks the YAML contract name: smoke # used as the pod's --name (template-rendered since v0.4.0) run_id_prefix: smoke state_file: ~/.runpod-smoke-current # local file recording the pod_id for `stop` local: project_root: . # relative to THIS yaml file's directory required_paths: # validate fails if any are missing locally - pyproject.toml pod: image: runpod/pytorch:2.4.0-py3.13-cuda12.4.1-devel-ubuntu22.04 datacenters: [EU-RO-1] # ordered failover list gpu_order: # ordered failover list - NVIDIA RTX A4000 cloud_type: SECURE # python_version: "3.13.5" # OPTIONAL (v0.5.0); pins via uv storage: mode: ephemeral # or network_volume volume_gb: 20 ssh: key_path: ~/.ssh/id_ed25519 setup: # commands that run BEFORE staging - command: "which rsync || (apt-get update && apt-get install -y rsync)" timeout_sec: 300 - command: "mkdir -p /workspace/smoke" staging: - label: source source: "{project_root}/" # template variables expand destination: "/workspace/smoke/" excludes_default: true # v0.4.0 — drops .git, .venv, caches # excludes_extra: [evals/] # additive on top of defaults run: script_path: /workspace/smoke.sh log_path: /workspace/smoke.log success_marker: "[smoke] DONE" body: | set -euo pipefail nvidia-smi echo "[smoke] DONE" lifecycle: on_success: delete # release volume disk on success (default) on_failure: stop # preserve paused for SSH forensics (default) ``` Every section is documented in [`config-reference.md`](config-reference.md). For the cost trade-offs in the `lifecycle:` block, see [`lifecycle.md` §7b](lifecycle.md#7b-cost-discipline-cleaning-up-after-forensics). --- ## 3. Validate the config ```bash runpod-deploy validate --config examples/smoke/a4000_smoke.yaml --all ``` `--all` enables every check: local-paths exist, GPU availability live in the datacenter, consumer-pyproject scan, hardcoded-paths scan. The basic `validate` (without `--all`) only checks schema correctness. --- ## 4. Walk the pipeline without provisioning For an **absolute-minimum** walkthrough that needs no RunPod account or SSH key registration, use the bundled `hello` example: ```bash runpod-deploy run --config examples/hello/hello.yaml --offline-dry-run ``` Or use the (more realistic) smoke config: ```bash runpod-deploy run --config examples/smoke/a4000_smoke.yaml --offline-dry-run ``` `--offline-dry-run` walks the full lifecycle (validate → "provision" → SSH wait → setup → stage → preflight → launch → monitor → pull → stop → manifest) but **never** calls `runpodctl`, SSH, or `rsync`. Each external command is just logged with its argv. Expected output (abbreviated): ``` runpodctl pod create --name smoke- --image runpod/pytorch:... --gpus 1 ... ssh-detached 'bash /workspace/smoke.sh' (dry-run) rsync-push:source {project_root}/ → pod:/workspace/smoke/ (dry-run) runpodctl pod stop (dry-run) ``` This is the unit test of "does my YAML make sense end-to-end" without spending money. Use it constantly during config iteration. --- ## 5. Run for real ```bash runpod-deploy run --config examples/smoke/a4000_smoke.yaml \ --cost-cap-usd 1.0 \ --max-runtime-minutes 15 ``` This will: 1. `runpodctl pod create ...` — provision an RTX A4000 in EU-RO-1. 2. Wait for SSH (~30–90s). 3. Run `setup` commands (install rsync if missing, mkdir). 4. `rsync` your repo to the pod. 5. Run `preflight` commands (none in this example). 6. Launch `/workspace/smoke.sh` via detached SSH. 7. Poll the remote log for `[smoke] DONE`. 8. Pull `/workspace/smoke.log` back to `artifacts/runpod//`. 9. `runpodctl pod stop`. 10. Write `artifacts/runpod//runpod_deploy_pull_manifest.json`. Total time: ~3 minutes. Cost: ~$0.05 at RTX A4000 rates. If anything goes wrong, the pod is stopped (per `lifecycle.on_failure: stop`, the default since v0.8.2) and the failure flow logs a WARNING before the manifest write. Read [`troubleshooting.md`](troubleshooting.md) for failure-mode diagnostics. --- ## 6. Inspect what just happened After the run completes: ```bash # List recent runs as a table runpod-deploy ls-runs # Print a single manifest summary runpod-deploy manifest-summary artifacts/runpod//runpod_deploy_pull_manifest.json # Or aggregate every manifest under artifacts/runpod/ with TOTALS runpod-deploy manifest-summary --root artifacts/runpod # Event-by-event timeline runpod-deploy events artifacts/runpod/ # Aggregate forensic queries across all runs runpod-deploy events-query --filter event=gpu_selected --since 7d runpod-deploy events-query --filter event=pod_killed_unexpected --since 30d --json ``` The manifest captures everything needed for reproducibility: local git SHA, `uv.lock` hash, GPU id, datacenter, wall time, estimated cost, artifact transfer status, and (when `pod.python_version` is set) the pinned interpreter version via the auto-injected preflight step. --- ## 7. Where to go next - **Your own config**: copy `examples/smoke/a4000_smoke.yaml` into your consumer repo at `configs/runpod/.yaml`. The recommended layout has `local.project_root: ../..` so the YAML lives at `/configs/runpod/.yaml` and `project_root` resolves to the consumer repo's root. See README "Consumer-owned configs". - **Sweeps**: drive multiple `runpod-deploy run` invocations from a bash for-loop with `--var seed=42` to template-substitute per shard. See [`recipes/multi-config-sweep.md`](recipes/multi-config-sweep.md). - **Pipeline patterns**: pre-flight checks, post-process aggregation, cost reconciliation, predictions-only-on-GPU, GPU-portability fallbacks. See [`recipes/README.md`](recipes/README.md). - **Library usage** (vs CLI): import `from runpod_deploy import run_job` and call directly with `cli_variables={"seed": "42"}` and `print_run_dir=True`. See [`extending.md`](extending.md). - **Failures**: when something breaks, [`troubleshooting.md`](troubleshooting.md) catalogs the known modes with diagnostics and fixes. - **Reference**: every YAML field documented in [`config-reference.md`](config-reference.md); every CLI flag via `runpod-deploy --help`.