Examples#

Minimal, focused worked examples — one concept per file. Each is runnable end-to-end under myst-nb (cells execute during sphinx-build; outputs render inline in the rendered HTML).

By capability#

Example

Demonstrates

Minimum extras

Metrics + bootstrap

pr_auc, roc_auc, brier_score, bootstrap_ci (BCa / percentile)

none

Evaluate harness

evaluate orchestrator, write_run_result, schema validation

[dataframe]

Calibration

Platt + isotonic recalibration, ECE before/after

none

Leakage detection

Exact dupe, normalized-form, label-conflict checks

[dataframe]

Claims + gates

EvidenceGate composition for release decisions

[dataframe]

Paired comparison

paired_bootstrap_diff, MDE for two-scorer comparisons

none

Prompt-injection walkthrough

Full pipeline on synthetic OWASP fixtures

[dataframe]

PyTorch scorer

Wrapping a PyTorch model as a Scorer (skip-execed in CI)

[dataframe], torch

Nested seed-split

LODO k-fold × multi-seed × stratified train/val composition

none

Callable embedder for dedup

EmbeddingCosineStrategy with make_minilm_embedder / custom embedders

[embeddings] (optional)

Cross-corpus contamination scan

pairs_across for benign-vs-injection contamination flagging

none

plot_roc_curve walkthrough

ROC rendering with threshold marker + baseline overlay

[plotting]

plot_pareto_frontier walkthrough

Cost-vs-performance scatter with frontier overlay

[plotting]

plot_slice_metric_heatmap walkthrough

(row × col metric) grid with colorbar + annotations

[plotting]

OOD manifest loader

ood_dataset_from_manifest — declarative loader for multiple OOD slates with sha256 caching

[dataframe], [yaml], [parquet]

Character-injection sweep

eval_toolkit.adversarial — six character-level techniques + Scorer-Protocol sweep for adversarial robustness

[dataframe]

ActivationDeltaProbe

eval_toolkit.probes.ActivationDeltaProbe — TaskTracker-style linear probe on transformer activation deltas

[probes] for real backbones; mocked illustration here

Spotlighting variants

eval_toolkit.preprocessing — delimit / datamark / encode structural defenses + batch sweep

none

RecallAtLowFPR loss

eval_toolkit.losses.RecallAtLowFPR — Meta Prompt Guard 2 training recipe (differentiable recall-at-fixed-FPR)

[losses]; static render in docs CI

How these run#

Since v0.38.0, examples are myst-nb notebooks (Markdown source with {code-cell} directives). Cells execute during sphinx-build with nb_execution_mode = "cache" — re-execution is triggered only when the source page changes. Cell outputs (printed text, tables, figures) render inline in the published HTML, so the docs site reflects the actual library behavior.

Two pages have execution disabled at page level because they require optional dependencies that aren’t in [dev]:

  • pytorch_scorer_example.md needs torch (~700MB transitive)

  • callable_embedder_dedup.md needs [embeddings] (sentence-transformers)

These pages render their code statically.