Worked example: slice-aware `evaluate` harness#

What this shows. Run two scorers across two slices via evaluate(...); persist the result with write_run_result(...); load the JSON back; verify it conforms to the results.v1.json schema.

Runtime: ~2 s. Requires pandas for EvalSlice’s DataFrame wrapper — install via pip install 'eval-toolkit[dataframe]'.

Setup#

import json
import numpy as np
import pandas as pd
from pathlib import Path
from tempfile import TemporaryDirectory
from eval_toolkit import (
    EvalSlice, evaluate, write_run_result, set_global_seeds,
)
from eval_toolkit.artifacts import validate_results
set_global_seeds(42)

Build two slices#

A “validation” slice (in-distribution) and an “ood” slice (lower-signal out-of-distribution). The harness scores each slice independently:

rng = np.random.default_rng(42)

def _make_slice(name: str, n: int, signal: float) -> EvalSlice:
    """Synthetic slice: balanced labels + discriminative-but-noisy scores."""
    y = np.concatenate([np.zeros(n // 2), np.ones(n - n // 2)]).astype(int)
    rng.shuffle(y)
    df = pd.DataFrame({
        "text": [f"{name}_row_{i}" for i in range(n)],
        "label": y,
    })
    return EvalSlice(name=name, df=df)

val_slice = _make_slice("validation", n=100, signal=0.4)
ood_slice = _make_slice("ood", n=80, signal=0.2)
print(f"slices: {val_slice.name} (n={len(val_slice.df)}), {ood_slice.name} (n={len(ood_slice.df)})")

slices: validation (n=100), ood (n=80)

Define two `Scorer` Protocols#

Any object with predict_proba(X) -> np.ndarray satisfies the Scorer Protocol. Toolkit consumers wire their real models here (sklearn estimators, PyTorch transformers, LLM judges); for this example we use two minimal stubs:

class _DiscriminativeStub:
    """Returns scores correlated with label + Gaussian noise."""
    def __init__(self, signal: float, noise: float, seed: int) -> None:
        self._signal = signal
        self._noise = noise
        self._rng = np.random.default_rng(seed)

    def predict_proba(self, X: list[str]) -> np.ndarray:
        n = len(X)
        # Recover label from the synthetic text suffix
        labels = np.array([int(x.split("_")[-1]) % 2 for x in X])
        return np.clip(
            0.5 + self._signal * (labels - 0.5) + self._rng.normal(0, self._noise, size=n),
            0.0, 1.0,
        )

baseline = _DiscriminativeStub(signal=0.3, noise=0.2, seed=42)
challenger = _DiscriminativeStub(signal=0.4, noise=0.15, seed=43)

Run `evaluate`#

evaluate is the pure (no IO) orchestrator: scorers × slices → RunResult. Bootstrap CIs on each (slice, scorer) cell:

result = evaluate(
    scorers={"baseline": baseline, "challenger": challenger},
    slices=[val_slice, ood_slice],
    run_id="example_run",
    n_resamples=50,  # smaller for the example — production: 1000+
    seed=42,
)
print(f"run_id: {result.run_id}")
print(f"slices in result: {list(result.by_slice.keys())}")
print(f"scorers per slice: {list(result.by_slice['validation']['by_scorer'].keys())}")

run_id: example_run
slices in result: ['validation', 'ood']
scorers per slice: ['baseline', 'challenger']

Persist + validate the JSON contract#

write_run_result writes both a compact and a full JSON. The compact one strips per-row prediction arrays so it’s small enough to git-commit; the full one keeps everything for offline analysis:

with TemporaryDirectory() as tmpdir:
    run_dir = Path(tmpdir) / "example_run"
    compact_path, full_path = write_run_result(result, run_dir)
    assert compact_path.exists()
    assert full_path.exists()
    payload = json.loads(compact_path.read_text())

    # The JSON contract: must validate against schemas/results.v1.json
    validate_results(payload)

    # Schema-required fields
    assert payload["schema_version"] == "v1"
    assert payload["run_id"] == "example_run"
    assert "validation" in payload["by_slice"]
    assert "baseline" in payload["by_slice"]["validation"]["by_scorer"]
    print("compact JSON validated against results.v1.json ✓")

compact JSON validated against results.v1.json ✓

Pre-1.0 design note#

evaluate(...) is pure: no filesystem touched. write_run_result(...) is the only IO sink. This split lets you test the harness logic deterministically (no tmp_path fixture needed for in-process verification) and keeps the durable on-disk artifact a separate, schema-validated layer.

See also#

harness.py reference — evaluate, evaluate_folded, EvalSlice, RunResult, Scorer Protocol.
artifacts.py reference — validate_results, validate_manifest, JSON-schema dispatcher.
Calibration example — apply fit_platt_calibrator to the scorer outputs before evaluating.
Leakage detection example — gate the harness with LeakageChecks.