Quickstart#
5 minutes from git clone to your first runpod-deploy run.
This walks through the one-time host setup, the anatomy of a YAML
config, and the --offline-dry-run mental model. For the full pipeline
walk, see lifecycle.md.
0. Prerequisites#
Python ≥ 3.13 (declared in
pyproject.toml’srequires-python).uvfor environment management:curl -LsSf https://astral.sh/uv/install.sh | sh.runpodctl(the official CLI you’re orchestrating around): download from GitHub releases and put on$PATH. Authenticate withrunpodctl config --apiKey <key>.rsync≥ 3.x on your local machine (macOS ships a 2.x by default; install via Homebrew if needed).An SSH key registered with RunPod (paste your public key into the RunPod web UI → Settings → SSH).
1. Install runpod-deploy#
pip install runpod-deploy
Or with uv:
uv pip install runpod-deploy
Verify:
runpod-deploy --help
You should see the subcommand list including validate, run,
gpu-list, manifest-summary, events-query, …
For contributors: see
CONTRIBUTING.mdfor the editable-install flow (git clone+uv pip install -e ".[dev]"), pre-commit setup, and the contribution guidelines.
2. Inspect a working example#
The repo ships a minimal smoke config:
cat examples/smoke/a4000_smoke.yaml
It’s annotated, but the structure is:
schema_version: 2 # required; locks the YAML contract
name: smoke # used as the pod's --name (template-rendered since v0.4.0)
run_id_prefix: smoke
state_file: ~/.runpod-smoke-current # local file recording the pod_id for `stop`
local:
project_root: . # relative to THIS yaml file's directory
required_paths: # validate fails if any are missing locally
- pyproject.toml
pod:
image: runpod/pytorch:2.4.0-py3.13-cuda12.4.1-devel-ubuntu22.04
datacenters: [EU-RO-1] # ordered failover list
gpu_order: # ordered failover list
- NVIDIA RTX A4000
cloud_type: SECURE
# python_version: "3.13.5" # OPTIONAL (v0.5.0); pins via uv
storage:
mode: ephemeral # or network_volume
volume_gb: 20
ssh:
key_path: ~/.ssh/id_ed25519
setup: # commands that run BEFORE staging
- command: "which rsync || (apt-get update && apt-get install -y rsync)"
timeout_sec: 300
- command: "mkdir -p /workspace/smoke"
staging:
- label: source
source: "{project_root}/" # template variables expand
destination: "/workspace/smoke/"
excludes_default: true # v0.4.0 — drops .git, .venv, caches
# excludes_extra: [evals/] # additive on top of defaults
run:
script_path: /workspace/smoke.sh
log_path: /workspace/smoke.log
success_marker: "[smoke] DONE"
body: |
set -euo pipefail
nvidia-smi
echo "[smoke] DONE"
lifecycle:
on_success: delete # release volume disk on success (default)
on_failure: stop # preserve paused for SSH forensics (default)
Every section is documented in config-reference.md.
For the cost trade-offs in the lifecycle: block, see
lifecycle.md §7b.
3. Validate the config#
runpod-deploy validate --config examples/smoke/a4000_smoke.yaml --all
--all enables every check: local-paths exist, GPU availability live
in the datacenter, consumer-pyproject scan, hardcoded-paths scan. The
basic validate (without --all) only checks schema correctness.
4. Walk the pipeline without provisioning#
For an absolute-minimum walkthrough that needs no RunPod account or
SSH key registration, use the bundled hello example:
runpod-deploy run --config examples/hello/hello.yaml --offline-dry-run
Or use the (more realistic) smoke config:
runpod-deploy run --config examples/smoke/a4000_smoke.yaml --offline-dry-run
--offline-dry-run walks the full lifecycle (validate → “provision” →
SSH wait → setup → stage → preflight → launch → monitor → pull →
stop → manifest) but never calls runpodctl, SSH, or rsync.
Each external command is just logged with its argv.
Expected output (abbreviated):
runpodctl pod create --name smoke-<ts> --image runpod/pytorch:... --gpus 1 ...
ssh-detached <pod> 'bash /workspace/smoke.sh' (dry-run)
rsync-push:source {project_root}/ → pod:/workspace/smoke/ (dry-run)
runpodctl pod stop <pod> (dry-run)
This is the unit test of “does my YAML make sense end-to-end” without spending money. Use it constantly during config iteration.
5. Run for real#
runpod-deploy run --config examples/smoke/a4000_smoke.yaml \
--cost-cap-usd 1.0 \
--max-runtime-minutes 15
This will:
runpodctl pod create ...— provision an RTX A4000 in EU-RO-1.Wait for SSH (~30–90s).
Run
setupcommands (install rsync if missing, mkdir).rsyncyour repo to the pod.Run
preflightcommands (none in this example).Launch
/workspace/smoke.shvia detached SSH.Poll the remote log for
[smoke] DONE.Pull
/workspace/smoke.logback toartifacts/runpod/<ts>/.runpodctl pod stop.Write
artifacts/runpod/<ts>/runpod_deploy_pull_manifest.json.
Total time: ~3 minutes. Cost: ~$0.05 at RTX A4000 rates.
If anything goes wrong, the pod is stopped (per lifecycle.on_failure: stop,
the default since v0.8.2) and the failure flow logs a WARNING before
the manifest write. Read troubleshooting.md for
failure-mode diagnostics.
6. Inspect what just happened#
After the run completes:
# List recent runs as a table
runpod-deploy ls-runs
# Print a single manifest summary
runpod-deploy manifest-summary artifacts/runpod/<ts>/runpod_deploy_pull_manifest.json
# Or aggregate every manifest under artifacts/runpod/ with TOTALS
runpod-deploy manifest-summary --root artifacts/runpod
# Event-by-event timeline
runpod-deploy events artifacts/runpod/<ts>
# Aggregate forensic queries across all runs
runpod-deploy events-query --filter event=gpu_selected --since 7d
runpod-deploy events-query --filter event=pod_killed_unexpected --since 30d --json
The manifest captures everything needed for reproducibility: local git
SHA, uv.lock hash, GPU id, datacenter, wall time, estimated cost,
artifact transfer status, and (when pod.python_version is set) the
pinned interpreter version via the auto-injected preflight step.
7. Where to go next#
Your own config: copy
examples/smoke/a4000_smoke.yamlinto your consumer repo atconfigs/runpod/<job>.yaml. The recommended layout haslocal.project_root: ../..so the YAML lives at<your-repo>/configs/runpod/<job>.yamlandproject_rootresolves to the consumer repo’s root. See README “Consumer-owned configs”.Sweeps: drive multiple
runpod-deploy runinvocations from a bash for-loop with--var seed=42to template-substitute per shard. Seerecipes/multi-config-sweep.md.Pipeline patterns: pre-flight checks, post-process aggregation, cost reconciliation, predictions-only-on-GPU, GPU-portability fallbacks. See
recipes/README.md.Library usage (vs CLI): import
from runpod_deploy import run_joband call directly withcli_variables={"seed": "42"}andprint_run_dir=True. Seeextending.md.Failures: when something breaks,
troubleshooting.mdcatalogs the known modes with diagnostics and fixes.Reference: every YAML field documented in
config-reference.md; every CLI flag viarunpod-deploy <subcommand> --help.