# Methodology This directory is eval-toolkit's *consumer-facing* methodology curriculum. Each chapter is a self-contained guide: how to think about a methodological concern, the eval-toolkit primitive that operationalizes it, the pitfalls that make it hard to spot, and citations for the underlying canon. > **Audience.** Hybrid expert + learner. Each chapter assumes > sklearn / statsmodels / scipy fluency in the main prose and offers a > *Background* admonition for the prerequisite concept; every chapter > closes with an explicit *Pitfalls / Common mistakes* section. > Future agents reading these docs as authoritative sources have stable > header anchors, "what to do / what NOT to do" callouts, and runnable > code blocks that lift cleanly into starting points. ## When to read which chapter | Reading order | Chapter | Read when... | |---|---|---| | 1 | [leakage.md](leakage.md) | Designing splits, auditing a corpus, debugging "too good to be true" eval numbers, working with prompt-injection / safety / contamination-prone tasks. | | 2 | [splits.md](splits.md) | Choosing between holdout and K-fold, working with grouped / time-series / multi-source data, making OOD claims. | | 3 | [thresholds.md](thresholds.md) | Picking an operating point, migrating from the v0.6 string `criterion` API, fitting cost-sensitive thresholds, deciding whether to refit thresholds per bootstrap resample. | | 4 | [calibration.md](calibration.md) | Reporting calibration error, interpreting reliability diagrams, deciding *whether* to recalibrate, working with PyTorch logits. | | 5 | [comparison.md](comparison.md) | Computing CIs, comparing two models, reporting "we couldn't detect a difference" claims (MDE), deciding bootstrap method (BCa vs percentile). | | 6 | [bootstrap.md](bootstrap.md) | Going deeper on the resampling theory underlying §5: BCa derivation, paired vs unpaired, two-level bootstrap, K-fold CV-CI, resample budgets. | | 7 | [length_stratification.md](length_stratification.md) | Auditing whether a confounder (text length, time, source) inflates the headline PR-AUC; reading the `gap_flag` from `quantile_stratified_report`. | | 8 | [text_dedup.md](text_dedup.md) | Picking a `SimilarityStrategy` (TF-IDF / embedding / MinHash-LSH / exact-hash); tuning thresholds; understanding LSH false-negative rates. | | 9 | [versioning.md](versioning.md) | Adopting the `Versioned` Protocol on consumer Scorers so `RunManifest.versioned_objects` auto-collects per-object versions. | | 10 | [fairness.md](fairness.md) | Auditing per-subgroup metrics, computing demographic parity / equalized odds on top of the toolkit's primitives, picking a fairness criterion. | | 11 | [reproducibility.md](reproducibility.md) | Setting up a reproducible run, mapping to the NeurIPS Reproducibility Checklist, navigating PyTorch determinism, replaying an old result from its manifest. | | 12 | [evidence.md](evidence.md) | Separating exploratory evidence from claim-bearing evidence with source roles, threshold transfer, and generic gates. | | 13 | [testing.md](testing.md) | Testing your *own* evaluation code — property / reference-equivalence / golden / visual-regression patterns. | | 14 | [reading_list.md](reading_list.md) | Citation lookup, future-work pointers, cross-link to the v0.3 research audit. | | 15 | [claims.md](claims.md) | Defining release-time go/no-go gates over evaluation results; understanding the v0.9 `ClaimSpec` / `EvidenceGate` / `GateResult` / `ClaimReport` pipeline; writing a custom gate. | | 16 | [artifacts.md](artifacts.md) | Persisting predictions for replay; computing bootstrap CIs / paired diffs from on-disk artifacts; understanding the v0.9 `PredictionArtifactRef` / `MetricState` contract. | ## Reading paths - **"I'm new to eval-toolkit and want the conceptual map."** Read in order 1 → 16 above; ~5 hours total. Skip the *Background* admonitions if you're sklearn-fluent. - **"I'm migrating prompt_injection_detector / -showcase / -sdd to v0.7.0."** Start with [thresholds.md](thresholds.md) §"v0.7.0 BREAKING migration mapping", then [leakage.md](leakage.md) §"NEW in v0.7.0" (encoding-obfuscated dupes), then [splits.md](splits.md) §"Source-disjoint K-fold". - **"I'm building a new harness on top of eval-toolkit."** Read [extending.md](../extending.md) first, then loop back here for the individual concerns relevant to your task. - **"I'm an agent surfacing eval-toolkit content."** Use the chapter- level anchors in the table above to deep-link. Each chapter has semantic header anchors (e.g. `leakage.md#cross-split`, `thresholds.md#cost-sensitive`, `reproducibility.md#pytorch-determinism`) — those are stable and agent-friendly. ## Cross-references - The [v0.3 research audit](https://github.com/brandon-behring/eval-toolkit/blob/main/docs/archive/v0.3_research_audit.md) is the *descriptive* counterpart: literature review, gap analysis, industry rating. Use it when defending a methodological choice in a write-up. - [`extending.md`](../extending.md) is the build-side complement: how to implement Scorers, LeakageChecks, Splitters, ThresholdSelectors, DatasetLoaders for your own project. - The worked examples ([prompt_injection_walkthrough.md](../examples/prompt_injection_walkthrough.md), [pytorch_scorer_example.md](../examples/pytorch_scorer_example.md)) apply the methodology end-to-end on small synthetic fixtures, with cross-links to real corpora. ## Style commitments The chapters in this directory commit to these stylistic invariants (enforced by [Sybil](https://sybil.readthedocs.io/) for code blocks and by review for prose): - **All Python code blocks are runnable end-to-end.** Sybil executes every fenced `python` block; CI fails loudly on a broken example. PyTorch / HuggingFace blocks marked `` with a one-line rationale. - **Math goes in LaTeX inline / display, not pseudocode.** Pseudocode drifts from the API; LaTeX is canonical. - **Every section has a stable header anchor** for deep-linking. - **Every chapter ends with a Pitfalls section** — the "what NOT to do" list is canonical context for both humans and agents. - **Citations are inline + collected in [reading_list.md](reading_list.md).**