Examples#

Minimal, focused worked examples — one concept per file. Each is runnable end-to-end under myst-nb (cells execute during sphinx-build; outputs render inline in the rendered HTML).

By capability#

Example	Demonstrates	Minimum extras
Metrics + bootstrap	`pr_auc`, `roc_auc`, `brier_score`, `bootstrap_ci` (BCa / percentile)	none
Evaluate harness	`evaluate` orchestrator, `write_run_result`, schema validation	`[dataframe]`
Calibration	Platt + isotonic recalibration, ECE before/after	none
Leakage detection	Exact dupe, normalized-form, label-conflict checks	`[dataframe]`
Claims + gates	`EvidenceGate` composition for release decisions	`[dataframe]`
Paired comparison	`paired_bootstrap_diff`, MDE for two-scorer comparisons	none
Prompt-injection walkthrough	Full pipeline on synthetic OWASP fixtures	`[dataframe]`
PyTorch scorer	Wrapping a PyTorch model as a `Scorer` (skip-execed in CI)	`[dataframe]`, `torch`
Nested seed-split	LODO k-fold × multi-seed × stratified train/val composition	none
Callable embedder for dedup	`EmbeddingCosineStrategy` with `make_minilm_embedder` / custom embedders	`[embeddings]` (optional)
Cross-corpus contamination scan	`pairs_across` for benign-vs-injection contamination flagging	none
`plot_roc_curve` walkthrough	ROC rendering with threshold marker + baseline overlay	`[plotting]`
`plot_pareto_frontier` walkthrough	Cost-vs-performance scatter with frontier overlay	`[plotting]`
`plot_slice_metric_heatmap` walkthrough	`(row × col → metric)` grid with colorbar + annotations	`[plotting]`
OOD manifest loader	`ood_dataset_from_manifest` — declarative loader for multiple OOD slates with sha256 caching	`[dataframe]`, `[yaml]`, `[parquet]`
Character-injection sweep	`eval_toolkit.adversarial` — six character-level techniques + Scorer-Protocol sweep for adversarial robustness	`[dataframe]`
ActivationDeltaProbe	`eval_toolkit.probes.ActivationDeltaProbe` — TaskTracker-style linear probe on transformer activation deltas	`[probes]` for real backbones; mocked illustration here
Spotlighting variants	`eval_toolkit.preprocessing` — delimit / datamark / encode structural defenses + batch sweep	none
RecallAtLowFPR loss	`eval_toolkit.losses.RecallAtLowFPR` — Meta Prompt Guard 2 training recipe (differentiable recall-at-fixed-FPR)	`[losses]`; static render in docs CI
LogisticStacker	`eval_toolkit.stacking.LogisticStacker` — combine multiple detector outputs into a calibrated meta-classifier via the `MetaLearner` Protocol	none

How these run#

Since v0.38.0, examples are myst-nb notebooks (Markdown source with {code-cell} directives). Cells execute during sphinx-build with nb_execution_mode = "cache" — re-execution is triggered only when the source page changes. Cell outputs (printed text, tables, figures) render inline in the published HTML, so the docs site reflects the actual library behavior.

Two pages have execution disabled at page level because they require optional dependencies that aren’t in [dev]:

pytorch_scorer_example.md needs torch (~700MB transitive)
callable_embedder_dedup.md needs [embeddings] (sentence-transformers)

These pages render their code statically.