Methodology#

This directory is eval-toolkit’s consumer-facing methodology curriculum. Each chapter is a self-contained guide: how to think about a methodological concern, the eval-toolkit primitive that operationalizes it, the pitfalls that make it hard to spot, and citations for the underlying canon.

Audience. Hybrid expert + learner. Each chapter assumes sklearn / statsmodels / scipy fluency in the main prose and offers a Background admonition for the prerequisite concept; every chapter closes with an explicit Pitfalls / Common mistakes section. Future agents reading these docs as authoritative sources have stable header anchors, “what to do / what NOT to do” callouts, and runnable code blocks that lift cleanly into starting points.

When to read which chapter#

Reading order	Chapter	Read when…
1	leakage.md	Designing splits, auditing a corpus, debugging “too good to be true” eval numbers, working with prompt-injection / safety / contamination-prone tasks.
2	splits.md	Choosing between holdout and K-fold, working with grouped / time-series / multi-source data, making OOD claims.
3	thresholds.md	Picking an operating point, migrating from the v0.6 string `criterion` API, fitting cost-sensitive thresholds, deciding whether to refit thresholds per bootstrap resample.
4	calibration.md	Reporting calibration error, interpreting reliability diagrams, deciding whether to recalibrate, working with PyTorch logits.
5	comparison.md	Computing CIs, comparing two models, reporting “we couldn’t detect a difference” claims (MDE), deciding bootstrap method (BCa vs percentile).
6	bootstrap.md	Going deeper on the resampling theory underlying §5: BCa derivation, paired vs unpaired, two-level bootstrap, K-fold CV-CI, resample budgets.
7	length_stratification.md	Auditing whether a confounder (text length, time, source) inflates the headline PR-AUC; reading the `gap_flag` from `quantile_stratified_report`.
8	text_dedup.md	Picking a `SimilarityStrategy` (TF-IDF / embedding / MinHash-LSH / exact-hash); tuning thresholds; understanding LSH false-negative rates.
9	versioning.md	Adopting the `Versioned` Protocol on consumer Scorers so `RunManifest.versioned_objects` auto-collects per-object versions.
10	fairness.md	Auditing per-subgroup metrics, computing demographic parity / equalized odds on top of the toolkit’s primitives, picking a fairness criterion.
11	reproducibility.md	Setting up a reproducible run, mapping to the NeurIPS Reproducibility Checklist, navigating PyTorch determinism, replaying an old result from its manifest.
12	evidence.md	Separating exploratory evidence from claim-bearing evidence with source roles, threshold transfer, and generic gates.
13	testing.md	Testing your own evaluation code — property / reference-equivalence / golden / visual-regression patterns.
14	reading_list.md	Citation lookup, future-work pointers, cross-link to the v0.3 research audit.
15	claims.md	Defining release-time go/no-go gates over evaluation results; understanding the v0.9 `ClaimSpec` / `EvidenceGate` / `GateResult` / `ClaimReport` pipeline; writing a custom gate.
16	artifacts.md	Persisting predictions for replay; computing bootstrap CIs / paired diffs from on-disk artifacts; understanding the v0.9 `PredictionArtifactRef` / `MetricState` contract.

Reading paths#

“I’m new to eval-toolkit and want the conceptual map.” Read in order 1 → 16 above; ~5 hours total. Skip the Background admonitions if you’re sklearn-fluent.
“I’m migrating prompt_injection_detector / -showcase / -sdd to v0.7.0.” Start with thresholds.md §”v0.7.0 BREAKING migration mapping”, then leakage.md §”NEW in v0.7.0” (encoding-obfuscated dupes), then splits.md §”Source-disjoint K-fold”.
“I’m building a new harness on top of eval-toolkit.” Read extending.md first, then loop back here for the individual concerns relevant to your task.
“I’m an agent surfacing eval-toolkit content.” Use the chapter- level anchors in the table above to deep-link. Each chapter has semantic header anchors (e.g. leakage.md#cross-split, thresholds.md#cost-sensitive, reproducibility.md#pytorch-determinism) — those are stable and agent-friendly.

Cross-references#

The v0.3 research audit is the descriptive counterpart: literature review, gap analysis, industry rating. Use it when defending a methodological choice in a write-up.
extending.md is the build-side complement: how to implement Scorers, LeakageChecks, Splitters, ThresholdSelectors, DatasetLoaders for your own project.
The worked examples (prompt_injection_walkthrough.md, pytorch_scorer_example.md) apply the methodology end-to-end on small synthetic fixtures, with cross-links to real corpora.

Style commitments#

The chapters in this directory commit to these stylistic invariants (enforced by Sybil for code blocks and by review for prose):

All Python code blocks are runnable end-to-end. Sybil executes every fenced python block; CI fails loudly on a broken example. PyTorch / HuggingFace blocks marked  with a one-line rationale.
Math goes in LaTeX inline / display, not pseudocode. Pseudocode drifts from the API; LaTeX is canonical.
Every section has a stable header anchor for deep-linking.
Every chapter ends with a Pitfalls section — the “what NOT to do” list is canonical context for both humans and agents.
Citations are inline + collected in reading_list.md.