# Extending eval-toolkit

This guide is the build-side complement to
[`docs/methodology/`](methodology/README.md). The methodology docs say
what good evaluation looks like; this guide says how to plug your code
into eval-toolkit's harness.

> **Three tiers, three entry points.** The toolkit is layered so you can
> start at the level of abstraction your task actually needs:
>
> - **Tier 1 — functional core.** Pure functions on `(y_true, y_score)`
>   arrays. No model coupling. Use this when you have predictions
>   already and just want metrics + CIs.
> - **Tier 2 — protocols.** Implement [`Scorer`](#scorer),
>   [`LeakageCheck`](#leakage-check), [`Splitter`](#splitter),
>   [`ThresholdSelector`](#threshold-selector),
>   [`DatasetLoader`](#dataset-loader),
>   [`SimilarityStrategy`](#similarity-strategy). Use these when you
>   want the harness to orchestrate.
> - **Tier 3 — reproducibility scaffolding.**
>   [`build_manifest`](api/manifest.md),
>   [`set_global_seeds`](api/seeds.md),
>   `provenance.*`. Use these regardless of which tier you build at.

## Setup (used throughout)

```python
import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice
```

(functional-core)=
## Tier 1 — functional core (no Protocol needed)
When you already have predictions, just call the metrics directly:

```python
from eval_toolkit import pr_auc, roc_auc, bootstrap_ci, paired_bootstrap_diff

rng = np.random.default_rng(42)
y = rng.binomial(1, 0.3, size=200)
s = np.clip(0.6 * y + rng.normal(0, 0.25, size=200), 0, 1)

ci = bootstrap_ci(y, s, pr_auc, n_resamples=500, seed=42)
print(f"PR-AUC: {ci.point_estimate:.3f}  CI [{ci.ci_low:.3f}, {ci.ci_high:.3f}]")
```

This is the fastest path — no Protocol implementation, no harness
orchestration, no manifest. Useful for ad-hoc analysis and notebooks.

(scorer)=
## Implementing a `Scorer`
The [`Scorer` Protocol](api/harness.md) is anything
exposing `predict_proba(X) -> np.ndarray of P(positive)`.

### sklearn classifier

Trivial — sklearn's `LogisticRegression`, `RandomForestClassifier`, etc.
already satisfy the Protocol. Wrap to return only the positive-class
column:

```python
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

class TfidfLogisticScorer:
    """sklearn pipeline as an eval_toolkit.Scorer."""
    version = "0.1.0"  # captured into RunManifest.versioned_objects

    def __init__(self) -> None:
        self.pipe = Pipeline([
            ("tfidf", TfidfVectorizer(ngram_range=(1, 2))),
            ("lr", LogisticRegression(max_iter=200, random_state=42)),
        ])

    def fit(self, X: list[str], y: np.ndarray) -> None:
        self.pipe.fit(X, y)

    def predict_proba(self, X: list[str]) -> np.ndarray:
        return self.pipe.predict_proba(X)[:, 1]


# Demo on tiny synthetic data:
scorer = TfidfLogisticScorer()
texts = [f"good text {i}" for i in range(50)] + [f"bad attack {i}" for i in range(50)]
labels = np.array([0] * 50 + [1] * 50)
scorer.fit(texts, labels)
preds = scorer.predict_proba(texts[:5])
print(f"first 5 scores: {preds.round(3)}")
```

Note the `version` attribute — implementing the
[`Versioned`](api/leakage.md) opt-in Protocol means
`build_manifest(versioned={...})` auto-captures it, so cross-version
metric comparisons can be invalidated. See
[`methodology/versioning.md`](methodology/versioning.md) for the full
story (when to expose `version`, how to choose a version string, the
lm-evaluation-harness pattern this mirrors).

### LLM-judge with cost control

The [`SliceAwareScorer`](api/harness.md) Protocol's
`should_score_slice(name)` hook lets the harness skip slices the
scorer doesn't need to score — critical for expensive LLM judges:

```python
class _LLMJudgeStub:
    """Pretend LLM-judge that runs only on the headline slice."""
    version = "claude-haiku-2026-q1"

    def predict_proba(self, X):
        # In production: a batched LLM call returning P(injection).
        return np.full(len(X), 0.5)

    def should_score_slice(self, slice_name: str) -> bool:
        # Cost-control: don't burn budget on subgroup / OOD slices.
        return slice_name == "test"


judge = _LLMJudgeStub()
print(f"score 'test' slice? {judge.should_score_slice('test')}")
print(f"score 'ood_lakera' slice? {judge.should_score_slice('ood_lakera')}")
```

`evaluate(..., scorers={'judge': judge})` calls `should_score_slice`
before scoring; skipped slices land in `RunResult.by_slice[name]
.by_scorer[scorer_name] = {"skipped": "<reason>"}`.

### PyTorch + transformer + LoRA scorer

See [pytorch_scorer_example.md](examples/pytorch_scorer_example.md) for
the worked example. The shape is the same — wrap an `nn.Module` so its
forward+softmax returns a numpy array.

(leakage-check)=
## Implementing a `LeakageCheck`
[`LeakageCheck`](api/leakage.md) takes
`Mapping[str, EvalSlice]` and returns a `LeakageFinding`. The uniform
input shape means within-split and cross-split checks share one
contract.

```python
from dataclasses import dataclass
from collections.abc import Mapping
from eval_toolkit import LeakageFinding

@dataclass(frozen=True, slots=True)
class TextLengthOutlierCheck:
    """A toy LeakageCheck: flag rows whose text is >5x the median length."""
    severity: str = "warning"

    @property
    def name(self) -> str:
        return "TextLengthOutlierCheck"

    def validate(self, splits: Mapping[str, EvalSlice]) -> LeakageFinding:
        drop: dict[str, list[int]] = {}
        n_affected = 0
        for split_name, slice_ in splits.items():
            lengths = np.array([len(t) for t in slice_.features])
            if len(lengths) == 0:
                continue
            median = float(np.median(lengths))
            mask = lengths > 5 * max(median, 1)
            if mask.any():
                drop[split_name] = sorted(np.where(mask)[0].tolist())
                n_affected += int(mask.sum())
        return LeakageFinding(
            check_name=self.name,
            severity=self.severity,  # type: ignore[arg-type]
            drop_indices=drop,
            evidence={"rule": "len(text) > 5 * median"},
            message=(f"{n_affected} length-outlier rows" if n_affected
                     else "no length outliers"),
            n_affected=n_affected,
        )

# Demo:
df = pd.DataFrame({"text": ["short"] * 10 + ["x" * 10000], "label": [0] * 10 + [1]})
splits = {"test": EvalSlice(name="test", df=df)}
finding = TextLengthOutlierCheck().validate(splits)
print(f"{finding.check_name}: {finding.message}")
```

The toolkit's reference impls in `leakage.py` are the canonical
patterns to mirror — see
[methodology/leakage.md](methodology/leakage.md) for which check goes
with which problem.

(splitter)=
## Implementing a `Splitter`
[`Splitter`](api/splits.md) yields fold-dicts ready for
`evaluate(...)`. The simplest implementation wraps an existing sklearn
splitter:

```python
from collections.abc import Iterator
from dataclasses import dataclass
from sklearn.model_selection import StratifiedShuffleSplit

@dataclass(frozen=True, slots=True)
class StratifiedShuffleSplitter:
    """A small Splitter wrapping sklearn.StratifiedShuffleSplit."""
    n_splits: int = 5
    test_size: float = 0.2
    seed: int = 42

    def iter_folds(
        self, slice_, *, groups=None
    ) -> Iterator[dict[str, EvalSlice]]:
        sss = StratifiedShuffleSplit(
            n_splits=self.n_splits, test_size=self.test_size,
            random_state=self.seed,
        )
        y = slice_.y_true
        x_dummy = np.arange(len(y)).reshape(-1, 1)
        for train_idx, test_idx in sss.split(x_dummy, y):
            yield {
                "train": EvalSlice(
                    name="train", df=slice_.df.iloc[train_idx].reset_index(drop=True),
                    feature_col=slice_.feature_col, label_col=slice_.label_col,
                    strata_col=slice_.strata_col,
                ),
                "test": EvalSlice(
                    name="test", df=slice_.df.iloc[test_idx].reset_index(drop=True),
                    feature_col=slice_.feature_col, label_col=slice_.label_col,
                    strata_col=slice_.strata_col,
                ),
            }

    def get_n_splits(self, slice_) -> int:
        return self.n_splits


# Demo:
df = pd.DataFrame({"text": [f"r{i}" for i in range(40)],
                   "label": [i % 2 for i in range(40)]})
parent = EvalSlice(name="all", df=df)
spl = StratifiedShuffleSplitter(n_splits=3, test_size=0.25)
for i, fold in enumerate(spl.iter_folds(parent)):
    print(f"  fold {i}: train={len(fold['train'].df)} test={len(fold['test'].df)}")
```

(threshold-selector)=
## Implementing a `ThresholdSelector`
[`ThresholdSelector`](api/thresholds.md) returns a
[`ThresholdResult`](api/metrics.md). A custom selector
is one short class:

```python
from dataclasses import dataclass
from eval_toolkit import ThresholdResult, metrics_at_threshold

@dataclass(frozen=True, slots=True)
class FixedThresholdSelector:
    """Always returns a caller-supplied threshold."""
    threshold: float

    @property
    def criterion(self) -> str:
        return f"fixed_{self.threshold:.3f}"

    def select(self, y_true, y_score) -> ThresholdResult:
        m = metrics_at_threshold(y_true, y_score, self.threshold)
        return ThresholdResult(
            threshold=float(self.threshold), f1=float(m["f1"]),
            precision=float(m["precision"]), recall=float(m["recall"]),
            criterion=self.criterion,
        )

# Demo:
y = np.array([0, 0, 1, 1, 0, 1])
s = np.array([0.1, 0.2, 0.7, 0.9, 0.3, 0.8])
result = FixedThresholdSelector(0.5).select(y, s)
print(f"fixed 0.5: F1={result.f1:.3f}  P={result.precision:.3f}")
```

When threshold-selection variance matters, pair with
`paired_bootstrap_op_point_diff` — see
[methodology/thresholds.md §"When to refit threshold per resample"
](methodology/thresholds.md#bootstrap-refit).

(dataset-loader)=
## Implementing a `DatasetLoader`
[`DatasetLoader`](api/loaders.md) returns
`dict[str, EvalSlice]` (HF `DatasetDict` shape) plus a Croissant-
compatible `describe()`. Tensor-agnostic: torch users tokenize inside
the `Scorer`, not the loader.

```python
from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class TwoListLoader:
    """Loader for the simplest case: two pre-split lists of texts/labels."""
    train_texts: list[str]
    train_labels: list[int]
    test_texts: list[str]
    test_labels: list[int]
    name: str = ""

    def load_splits(self) -> dict[str, EvalSlice]:
        return {
            "train": EvalSlice(
                name="train",
                df=pd.DataFrame({"text": self.train_texts, "label": self.train_labels}),
            ),
            "test": EvalSlice(
                name="test",
                df=pd.DataFrame({"text": self.test_texts, "label": self.test_labels}),
            ),
        }

    def describe(self) -> dict[str, object]:
        return {
            "name": self.name or "TwoListLoader",
            "description": "",
            "citeAs": "",
            "license": "",
            "url": "",
            "distribution": [],
            "n_train": len(self.train_texts),
            "n_test": len(self.test_texts),
        }


# Demo:
loader = TwoListLoader(
    train_texts=["a", "b"], train_labels=[0, 1],
    test_texts=["c"], test_labels=[1], name="demo",
)
splits = loader.load_splits()
print(f"keys={list(splits.keys())}  describe.name={loader.describe()['name']}")
```

(similarity-strategy)=
## Implementing a `SimilarityStrategy`
This Protocol predates v0.7.0; see
[`text_dedup.py`](api/text_dedup.md)'s docstring and
the existing reference impls (TfidfCosineStrategy,
ExactNormalizedHashStrategy, EmbeddingCosineStrategy,
JaccardNgramStrategy, MinHashLSHStrategy). The shape:
`pairs_within(texts, k_neighbors)` → similarity / index arrays. Pluggable
backend for `near_dedup` and `cross_dedup` and (transitively)
`NearDuplicateCheck` / `CrossSplitLeakageCheck`.

(recipe)=
## Recipe: full custom eval harness in ~50 lines
Combines every Tier-2 Protocol + the manifest:

```python
from eval_toolkit import (
    EvalSlice, evaluate_folded,
    NormalizedFormLeakageCheck, LabelConflictCheck,
    StratifiedKFoldSplitter, MaxF1Selector,
    build_manifest, write_manifest, set_global_seeds,
)

set_global_seeds(42)

# 1. Dataset (use the loader you've built or the TwoListLoader above).
df = pd.DataFrame({
    "text": [f"benign_{i}" if i < 30 else f"injection_{i}" for i in range(60)],
    "label": [0 if i < 30 else 1 for i in range(60)],
})
parent = EvalSlice(name="all", df=df)

# 2. Scorer (use the TfidfLogisticScorer above or your transformer adapter).
class _DummyScorer:
    version = "0.0.0"
    def predict_proba(self, X):
        return np.array([0.7 if "injection" in t else 0.2 for t in X])

# 3. Run K-fold + leakage checks + auto CV-CI summary.
result = evaluate_folded(
    {"dummy": _DummyScorer()},
    StratifiedKFoldSplitter(k=3, seed=42),
    parent,
    run_id="custom-harness-demo",
    leakage_checks=[NormalizedFormLeakageCheck(), LabelConflictCheck()],
    on_leakage="record",
    eval_split_names=("test",),
)

print(f"folds: {len(result.by_fold)}")
fs = result.fold_summary["test"]["dummy"]["pr_auc"]
print(f"PR-AUC: {fs['mean']:.3f}  CI [{fs['ci_low']:.3f}, {fs['ci_high']:.3f}]")

# 4. Reproducibility manifest.
import tempfile
m = build_manifest(
    run_id="custom-harness-demo",
    config={"k": 3, "seed": 42, "scorer": "dummy"},
    seeds={"global": 42, "bootstrap": 42},
    versioned={"dummy": _DummyScorer()},
)
with tempfile.TemporaryDirectory() as d:
    write_manifest(m, d)
    print(f"manifest captured: schema={m.schema_version}")
```

That's the full pipeline: dataset loading, leakage validation,
splitting, scoring, CV-CI aggregation, manifest emission. Every step
above is replaceable with a custom Protocol implementation.

(project-layout)=
## Project layout for downstream consumers
The [`prompt_injection_classifier_showcase`
](https://github.com/brandon-behring/prompt_injection_classifier_showcase)
repo is the canonical worked example. Mirror its layout:

```
your_project/
  src/your_project/
    scorers.py       # Scorer implementations
    data.py          # DatasetLoader (only if not using built-in loaders)
    evaluate.py      # thin script: load → check → split → score → manifest
  tests/
    test_scorers.py            # smoke + reference-equivalence
    test_evaluate_smoke.py     # end-to-end on a tiny fixture
  evals/
    run_<timestamp>/
      results.json
      results_full.json
      manifest.json
  pyproject.toml     # pin eval-toolkit>=0.7.0,<0.8
```

The harness owns the orchestration; your project owns scorer
implementations, data loading, and the (small) script that wires them
together.

## Further reading

- [`docs/methodology/`](methodology/README.md) — concept-by-concept
  guide for what the Protocols actually operationalize.
- [`docs/examples/prompt_injection_walkthrough.md`](examples/prompt_injection_walkthrough.md)
  — end-to-end PI workflow on a synthetic fixture.
- [`docs/examples/pytorch_scorer_example.md`](examples/pytorch_scorer_example.md)
  — Transformer + LoRA Scorer adapter.
- The toolkit's own `tests/` — smoke tests for every reference impl,
  pattern templates for property tests in PR 1.5.