---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

# Worked example: slice-aware `evaluate` harness

> **What this shows.** Run two scorers across two slices via
> `evaluate(...)`; persist the result with `write_run_result(...)`; load
> the JSON back; verify it conforms to the `results.v1.json` schema.
>
> **Runtime:** ~2 s. Requires `pandas` for `EvalSlice`'s DataFrame
> wrapper — install via `pip install 'eval-toolkit[dataframe]'`.

## Setup

```{code-cell}
import json
import numpy as np
import pandas as pd
from pathlib import Path
from tempfile import TemporaryDirectory
from eval_toolkit import (
    EvalSlice, evaluate, write_run_result, set_global_seeds,
)
from eval_toolkit.artifacts import validate_results
set_global_seeds(42)
```

## Build two slices

A "validation" slice (in-distribution) and an "ood" slice (lower-signal
out-of-distribution). The harness scores each slice independently:

```{code-cell}
rng = np.random.default_rng(42)

def _make_slice(name: str, n: int, signal: float) -> EvalSlice:
    """Synthetic slice: balanced labels + discriminative-but-noisy scores."""
    y = np.concatenate([np.zeros(n // 2), np.ones(n - n // 2)]).astype(int)
    rng.shuffle(y)
    df = pd.DataFrame({
        "text": [f"{name}_row_{i}" for i in range(n)],
        "label": y,
    })
    return EvalSlice(name=name, df=df)

val_slice = _make_slice("validation", n=100, signal=0.4)
ood_slice = _make_slice("ood", n=80, signal=0.2)
print(f"slices: {val_slice.name} (n={len(val_slice.df)}), {ood_slice.name} (n={len(ood_slice.df)})")
```

## Define two `Scorer` Protocols

Any object with `predict_proba(X) -> np.ndarray` satisfies the
`Scorer` Protocol. Toolkit consumers wire their real models here
(sklearn estimators, PyTorch transformers, LLM judges); for this example
we use two minimal stubs:

```{code-cell}
class _DiscriminativeStub:
    """Returns scores correlated with label + Gaussian noise."""
    def __init__(self, signal: float, noise: float, seed: int) -> None:
        self._signal = signal
        self._noise = noise
        self._rng = np.random.default_rng(seed)

    def predict_proba(self, X: list[str]) -> np.ndarray:
        n = len(X)
        # Recover label from the synthetic text suffix
        labels = np.array([int(x.split("_")[-1]) % 2 for x in X])
        return np.clip(
            0.5 + self._signal * (labels - 0.5) + self._rng.normal(0, self._noise, size=n),
            0.0, 1.0,
        )

baseline = _DiscriminativeStub(signal=0.3, noise=0.2, seed=42)
challenger = _DiscriminativeStub(signal=0.4, noise=0.15, seed=43)
```

## Run `evaluate`

`evaluate` is the pure (no IO) orchestrator: scorers × slices →
`RunResult`. Bootstrap CIs on each (slice, scorer) cell:

```{code-cell}
result = evaluate(
    scorers={"baseline": baseline, "challenger": challenger},
    slices=[val_slice, ood_slice],
    run_id="example_run",
    n_resamples=50,  # smaller for the example — production: 1000+
    seed=42,
)
print(f"run_id: {result.run_id}")
print(f"slices in result: {list(result.by_slice.keys())}")
print(f"scorers per slice: {list(result.by_slice['validation']['by_scorer'].keys())}")
```

## Persist + validate the JSON contract

`write_run_result` writes both a compact and a full JSON. The compact
one strips per-row prediction arrays so it's small enough to git-commit;
the full one keeps everything for offline analysis:

```{code-cell}
with TemporaryDirectory() as tmpdir:
    run_dir = Path(tmpdir) / "example_run"
    compact_path, full_path = write_run_result(result, run_dir)
    assert compact_path.exists()
    assert full_path.exists()
    payload = json.loads(compact_path.read_text())

    # The JSON contract: must validate against schemas/results.v1.json
    validate_results(payload)

    # Schema-required fields
    assert payload["schema_version"] == "v1"
    assert payload["run_id"] == "example_run"
    assert "validation" in payload["by_slice"]
    assert "baseline" in payload["by_slice"]["validation"]["by_scorer"]
    print("compact JSON validated against results.v1.json ✓")
```

## Pre-1.0 design note

`evaluate(...)` is *pure*: no filesystem touched. `write_run_result(...)`
is the only IO sink. This split lets you test the harness logic
deterministically (no `tmp_path` fixture needed for in-process verification)
and keeps the durable on-disk artifact a separate, schema-validated layer.

## See also

- [`harness.py` reference](../api/harness.md) — `evaluate`,
  `evaluate_folded`, `EvalSlice`, `RunResult`, `Scorer` Protocol.
- [`artifacts.py` reference](../api/artifacts.md) — `validate_results`,
  `validate_manifest`, JSON-schema dispatcher.
- [Calibration example](calibration.md) — apply `fit_platt_calibrator`
  to the scorer outputs before evaluating.
- [Leakage detection example](leakage_detection.md) — gate the harness
  with `LeakageCheck`s.