# Getting Started

A linear walkthrough from "I have a trained model" to "I have a
`results.json` a stakeholder can read." Aimed at Python-fluent readers
new to eval-toolkit; no prior sklearn-eval experience assumed.

If you're already comfortable with sklearn-style evaluation, you can
skim the conceptual sections (marked **What is...**) and read the code
blocks directly.

## Table of contents

1. [What is an eval, and what does this toolkit do?](#what-is-an-eval)
2. [Install](#install)
3. [The Scorer concept](#scorer)
4. [The EvalSlice concept](#eval-slice)
5. [Run `evaluate()` and read the output](#evaluate)
6. [Persist results](#persist)
7. [(Optional) Validate the JSON](#validate)
8. [(Optional) Add a claim](#add-a-claim)
9. [(Optional) Render a plot](#plot)
10. [Common errors](#common-errors)
11. [Where to go next](#where-to-go-next)

(what-is-an-eval)=
## What is an eval, and what does this toolkit do?
**An evaluation** is the process of turning a model's predictions into
**calibrated metrics with uncertainty**. The numbers (PR-AUC, ROC-AUC,
precision-at-recall-X) are the surface. The *calibration* (does the
score 0.8 actually mean 80% chance of positive?) and the *uncertainty*
(is the +5 pp PR-AUC lift over baseline likely real or noise?) are the
substance.

This toolkit sits between two things you already have:

- **A model that produces probability scores.** Could be sklearn,
  PyTorch, an API call to a hosted model, a regex — anything that
  takes inputs and returns `P(positive)`.
- **Labeled data to evaluate it on.** Rows with a binary label and a
  text (or feature) column.

What you get back:

- Headline metrics (`pr_auc`, `roc_auc`, `brier_score`, ...)
- Bootstrap confidence intervals on those metrics
- Per-slice breakdowns (dev vs test, by source, by strata)
- Paired-difference CIs when comparing two models on the same rows
- A reproducible manifest (`git_sha`, seed, GPU info, dataset hashes)
- A `results.json` and `manifest.json` that downstream consumers can
  parse against a versioned JSON Schema

The toolkit does **not** ship report templates, dashboard renderers,
or claim copy — those are domain-specific and belong in your
consumer code.

(install)=
## Install
```bash
pip install eval-toolkit
```

Or with optional extras:

```bash
pip install "eval-toolkit[dataframe,plotting,validation]"
```

Common extras:

- `dataframe` — `pandas`. Required if you want to pass `pd.DataFrame`
  to `EvalSlice` (the easy path; this guide assumes it).
- `plotting` — `matplotlib` + `pillow`. Required for the `plot_*`
  helpers.
- `validation` — `jsonschema`. Required for `validate_payload(...)`.
- `property` — `hypothesis`. Only if you write property tests against
  the toolkit itself.
- `all` — everything optional, the kitchen-sink install.

This guide uses `dataframe` and `validation`. Plotting is optional
section [(9)](#plot).

(getting-started-scorer)=
## The Scorer concept
**A `Scorer`** is anything that exposes a `predict_proba(X)` method
returning one probability per input row, where `probability` ∈ [0, 1]
and represents `P(positive class)`.

That's the entire contract. It's deliberately Protocol-based: you
don't subclass anything, you just implement the method. Your model
class probably already does this (`sklearn` estimators do;
`transformers` pipelines do not natively but it's a one-liner wrapper).

### Example: a minimal Scorer

```python
import numpy as np


class LengthScorer:
    """Scores longer texts higher. Useful only as a demo Scorer."""

    def predict_proba(self, X: list[str]) -> np.ndarray:
        # Map length to a [0, 1] score via a saturating function.
        lengths = np.array([len(x) for x in X], dtype=float)
        return lengths / (lengths + 10.0)


scorer = LengthScorer()
probs = scorer.predict_proba(["hi", "hello world"])
assert probs.shape == (2,)
assert (0.0 <= probs).all() and (probs <= 1.0).all()
```

That's a fully valid `Scorer`. No registration, no base class.

If you have an sklearn pipeline:

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# sklearn pipelines already implement predict_proba(X) → (n, n_classes).
# Wrap to return only the positive-class column.

class SklearnBinaryScorer:
    def __init__(self, pipe):
        self.pipe = pipe

    def predict_proba(self, X) -> np.ndarray:
        return self.pipe.predict_proba(X)[:, 1]


pipe = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("clf", LogisticRegression(max_iter=200)),
])
# (you'd fit pipe on training data here)
```

If your model is async / behind an API: cache the responses upfront,
then have `predict_proba` look up the cached scores. The toolkit
doesn't care.

(eval-slice)=
## The EvalSlice concept
**An `EvalSlice`** is *the unit of evaluation*: a named, labeled
subset of data that you want metrics computed on. You typically have
several:

- `dev` and `test` (the standard split)
- `by_source` (predictions on different data sources)
- `by_strata` (predictions on different label-balanced strata)
- OOD slices, regression slices, stress-test slices, etc.

Each slice is constructed from a pandas DataFrame with at minimum a
`text` and `label` column. `label` must be `{0, 1}`.

### Example: building two slices

```python
import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice

# Synthetic dev set: 100 rows, balanced classes.
rng = np.random.default_rng(42)
n = 100
labels = rng.integers(0, 2, size=n)
# Texts whose length correlates with the label.
texts = [
    "x" * (3 + int(label) * 8 + int(rng.integers(0, 4)))
    for label in labels
]
dev_df = pd.DataFrame({"text": texts, "label": labels})

dev_slice = EvalSlice(name="dev", df=dev_df)
assert dev_slice.name == "dev"
assert len(dev_slice.df) == 100
```

The constructor validates the shape: `text` and `label` columns must
exist, labels must be in `{0, 1}` (other label encodings raise a
`ValueError`), and the DataFrame must be non-empty.

If you have multiple sources to evaluate per-source:

```python
import pandas as pd
from eval_toolkit import EvalSlice

# Tag each row with its source, then build one slice per source.
df = pd.DataFrame({
    "text": ["a", "b", "c", "d", "e", "f"],
    "label": [0, 1, 0, 1, 0, 1],
    "source": ["A", "A", "B", "B", "C", "C"],
})

slices = [
    EvalSlice(name=f"source_{src}", df=sub.reset_index(drop=True))
    for src, sub in df.groupby("source")
]
assert len(slices) == 3
```

(evaluate)=
## Run `evaluate()` and read the output
`evaluate(...)` is the orchestrator. Given a mapping of scorers and a
list of slices, it computes the full headline-metric battery per
(slice, scorer) pair, runs bootstrap CIs, and returns a `RunResult`.

```python
import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, evaluate


class LengthScorer:
    def predict_proba(self, X):
        lengths = np.array([len(x) for x in X], dtype=float)
        return lengths / (lengths + 10.0)


rng = np.random.default_rng(0)
n = 100
labels = rng.integers(0, 2, size=n)
texts = ["x" * (3 + int(label) * 8) for label in labels]
df = pd.DataFrame({"text": texts, "label": labels})
dev_slice = EvalSlice(name="dev", df=df)

result = evaluate(
    {"length": LengthScorer()},
    [dev_slice],
    run_id="demo-run",
    n_resamples=50,  # small for the doctest; use 1000+ in real runs
    seed=42,
)

assert result.run_id == "demo-run"
assert "dev" in result.by_slice
```

### Reading the output

`result.by_slice` is a nested dict:

```
by_slice
├── "dev"
│   ├── "n"          : 100
│   ├── "n_positive" : ~50 (depends on RNG)
│   ├── "by_scorer"
│   │   └── "length"
│   │       ├── "pr_auc"           : float in [0, 1]
│   │       ├── "roc_auc"          : float in [0, 1]
│   │       ├── "pr_auc_ci"        : BootstrapCI dict
│   │       │   ├── "point_estimate" : float
│   │       │   ├── "ci_95"          : [low, high]   (or "skipped" if n<30)
│   │       │   ├── "confidence"     : 0.95
│   │       │   ├── "n_resamples"    : 50
│   │       │   └── "method"         : "BCa" | "percentile"
│   │       ├── "ece"              : float (expected calibration error)
│   │       └── ...                 (other metrics, plus operating_points)
│   └── "paired_diffs" : {}  (empty unless paired_diffs= explicitly set)
```

Access a metric:

```python
import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, evaluate


class _Scorer:
    def predict_proba(self, X):
        return np.array([len(x) / (len(x) + 10) for x in X])


# Bootstrap CIs require n >= 30; use a bigger slice than the toy 3-row.
rng = np.random.default_rng(0)
n = 40
labels = rng.integers(0, 2, size=n)
texts = ["x" * (3 + int(label) * 8) for label in labels]
df = pd.DataFrame({"text": texts, "label": labels})
result = evaluate({"m": _Scorer()}, [EvalSlice(name="dev", df=df)], run_id="r", n_resamples=20)

pr_auc = result.by_slice["dev"]["by_scorer"]["m"]["pr_auc"]
ci = result.by_slice["dev"]["by_scorer"]["m"]["pr_auc_ci"]
assert 0.0 <= pr_auc <= 1.0
# ci is a BootstrapCI dict with point_estimate + ci_95 [low, high]
assert "ci_95" in ci or ci.get("status") == "skipped"
```

### Comparing two scorers

When you want a paired-difference CI between two scorers on the same
rows, pass `paired_diffs=[(baseline, candidate)]` to `evaluate(...)`:

```python
import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, evaluate


class A:
    def predict_proba(self, X):
        return np.array([0.3 + 0.4 * (i % 2) for i in range(len(X))])


class B:
    def predict_proba(self, X):
        return np.array([0.4 + 0.5 * (i % 2) for i in range(len(X))])


df = pd.DataFrame({"text": ["x"] * 40, "label": [0, 1] * 20})
result = evaluate(
    {"a": A(), "b": B()},
    [EvalSlice(name="dev", df=df)],
    run_id="r",
    n_resamples=20,
    paired_diffs=[("a", "b")],  # explicit baseline → candidate pair
)
diffs = result.by_slice["dev"]["paired_diffs"]
assert ("a", "b") in diffs or "a__minus__b" in diffs or len(diffs) >= 1
```

(persist)=
## Persist results
`RunResult.to_dict()` produces a strict-JSON-safe payload:

```python
import json
import tempfile
from pathlib import Path

import numpy as np
import pandas as pd

from eval_toolkit import EvalSlice, evaluate
from eval_toolkit.artifacts import write_json_strict


class _S:
    def predict_proba(self, X):
        return np.linspace(0.1, 0.9, len(X))


df = pd.DataFrame({"text": [f"row_{i}" for i in range(10)], "label": [0, 1] * 5})
result = evaluate({"m": _S()}, [EvalSlice(name="dev", df=df)], run_id="demo", n_resamples=10)

out_path = Path(tempfile.gettempdir()) / "demo_results.json"
write_json_strict(result.to_dict(), out_path)

# What the on-disk JSON looks like:
data = json.loads(out_path.read_text())
assert data["run_id"] == "demo"
assert "schema_version" in data
```

`write_json_strict` uses `allow_nan=False` and runs the payload
through `sanitize_for_json` first — NaN / Inf becomes a structured
`skipped_metric(...)` payload rather than producing invalid JSON.

(validate)=
## (Optional) Validate the JSON
Validate against the bundled JSON Schema to catch shape regressions
between your harness and consumer parsers:

```python
# Requires: pip install "eval-toolkit[validation]"
import json
import tempfile
from pathlib import Path

import numpy as np
import pandas as pd

from eval_toolkit import EvalSlice, evaluate
from eval_toolkit.artifacts import validate_payload, write_json_strict


class _S:
    def predict_proba(self, X):
        return np.linspace(0.1, 0.9, len(X))


df = pd.DataFrame({"text": [f"r{i}" for i in range(10)], "label": [0, 1] * 5})
result = evaluate({"m": _S()}, [EvalSlice(name="dev", df=df)], run_id="demo", n_resamples=10)

# This is a no-op on success; raises jsonschema.ValidationError on a bad shape.
validate_payload(result.to_dict(), schema_name="results.v1.json")
```

You can also validate from the CLI without writing Python:

```bash
eval-toolkit validate run_dir/results.json results.v1
```

(See [docs/schemas.md](schemas.md) for the field-by-field reference.)

(add-a-claim)=
## (Optional) Add a claim
**A claim** is a release-time go/no-go assertion: "PR-AUC is supported
on the dev slice with at least 100 positives and 100 negatives, and
the metric value is above 0.7." Claims are *not* exploratory metrics —
they're preregistered preconditions that the renderer reads to decide
whether to print "we claim X" or "we cannot claim X."

```python
import numpy as np
import pandas as pd

from eval_toolkit import EvalSlice, evaluate
from eval_toolkit.claims import (
    ClaimSpec,
    evaluate_claims,
    metric_threshold_gate,
    minimum_slice_size_gate,
    required_metric_gate,
)
from eval_toolkit.harness import with_claim_report


class _S:
    def predict_proba(self, X):
        # Score = 0.9 if label-marker, else 0.1
        return np.array([0.9 if "P" in x else 0.1 for x in X])


df = pd.DataFrame({
    "text": ["P_a", "P_b", "N_a", "N_b"] * 50,
    "label": [1, 1, 0, 0] * 50,
})
result = evaluate({"m": _S()}, [EvalSlice(name="dev", df=df)], run_id="demo", n_resamples=20)

claim = ClaimSpec(
    name="dev_pr_auc_supported",
    gates=(
        required_metric_gate("dev", "m", "pr_auc"),
        minimum_slice_size_gate("dev", min_n=100, min_positive=20, min_negative=20),
        metric_threshold_gate("dev", "m", "pr_auc", op=">=", threshold=0.7),
    ),
)

report = evaluate_claims(result, [claim])
assert report.has_failures() is False

# Attach the claim report to the RunResult for the renderer to read:
result_with_claim = with_claim_report(result, report)
assert result_with_claim.claim_report is not None
```

Each of the three gate calls above (`required_metric_gate`,
`minimum_slice_size_gate`, `metric_threshold_gate`) is a factory
that returns an `EvidenceGate` instance — a frozen dataclass bundling
a callable check, a name, and a severity. Custom gates are written by
constructing `EvidenceGate` directly with your own check function;
the [`claims_and_gates`](examples/claims_and_gates.md) example walks
through both reference and custom gates end-to-end.

See [methodology/claims.md](methodology/claims.md) for the full
contract — exception handling, severity policy, custom gates.

(plot)=
## (Optional) Render a plot
```python
# Requires: pip install "eval-toolkit[plotting]"
import matplotlib
matplotlib.use("Agg")  # non-interactive backend for headless / docs runs

import numpy as np
import tempfile
from pathlib import Path

from eval_toolkit.plotting import plot_metric_bars, save_figure

# Synthetic per-scorer metric summary:
values = {"baseline": 0.65, "candidate_v1": 0.78, "candidate_v2": 0.82}
fig = plot_metric_bars(values, ylabel="PR-AUC", title="Dev slice")
out_path = Path(tempfile.gettempdir()) / "pr_auc_bars.png"
saved = save_figure(fig, out_path)
assert saved.exists()
```

The plotting module's API and visual conventions are documented in
each helper's docstring. See `eval_toolkit.plotting.__all__` for the
full list (`plot_pr_curve`, `plot_reliability_diagram`,
`plot_confusion_matrix_grid`, `plot_score_histograms`, `plot_lift_ci`,
`plot_bootstrap_distribution`).

(common-errors)=
## Common errors
A handful of mistakes are statistically more likely than the rest when
you're starting out:

(error-labels)=
### `ValueError: labels must be in {0, 1}`
Your DataFrame has labels other than `0` / `1` — strings, booleans
encoded as integers, or `-1` sentinel values. eval-toolkit treats
binary classification as `{0, 1}` only.

Fix: convert before constructing the slice.

```python
import pandas as pd

raw = pd.DataFrame({"text": ["a", "b"], "label": ["pos", "neg"]})
raw["label"] = (raw["label"] == "pos").astype(int)
# Now raw["label"] is {0, 1}.
assert set(raw["label"]) <= {0, 1}
```

(error-strata)=
### `KeyError: missing strata column 'X'`
You passed `strata_col="X"` to `EvalSlice` but the DataFrame has no
column named `X`. Either remove the `strata_col=` argument or add the
column.

(error-wide-ci)=
### Bootstrap CIs are very wide
Either `n_resamples` is too low (default in this guide is 50 for
docs-speed; use **1000+** in real runs), or your slice has very few
positives or negatives. The CI width is a function of *both* the
resampling budget *and* the underlying sample size — adding more
resamples won't help if you only have 5 positives.

Check the slice composition:

```python
import pandas as pd
from eval_toolkit import EvalSlice

df = pd.DataFrame({"text": ["a", "b", "c"], "label": [0, 0, 1]})
slc = EvalSlice(name="dev", df=df)
n_positive = int(slc.df["label"].sum())
n_negative = len(slc.df) - n_positive
assert n_positive >= 1 and n_negative >= 1  # else PR-AUC is undefined
```

(error-pr-curve)=
### `RuntimeError: PR curve has no thresholds`
Your `predict_proba` returned a constant value for every input. PR /
ROC curves are undefined for a single threshold. Fix: check that your
model isn't outputting the same score for every row.

(error-pandas)=
### `'TYPE_CHECKING' import error` for pandas
The `dataframe` extra (`pip install "eval-toolkit[dataframe]"`)
installs pandas. Without it, you can still use `EvalSlice` with
DataFrames — pandas is a soft dep — but `import pandas` will fail in
your harness code. Install the extra if you're using DataFrames at all
(this guide assumes you are).

(where-to-go-next)=
## Where to go next
You now have a working `RunResult` and `results.json`. Recommended
next reading depending on what you're doing:

- **Building a real eval pipeline.** Read three methodology chapters
  in this order:
  1. [`leakage.md`](methodology/leakage.md) — making sure your eval
     data isn't contaminated by training data.
  2. [`splits.md`](methodology/splits.md) — choosing between holdout
     and K-fold, source-disjoint splitting.
  3. [`thresholds.md`](methodology/thresholds.md) — picking a decision
     threshold once your scorer ranks well.

- **Adding release-time claims.** Read
  [`methodology/claims.md`](methodology/claims.md) for the full gate
  contract and severity policy.

- **Replaying old evals.** Read
  [`methodology/artifacts.md`](methodology/artifacts.md) for the
  `PredictionArtifactRef` contract that lets you recompute metrics
  without re-running inference.

- **Writing a custom Scorer/Splitter/Gate.** Read
  [`extending.md`](extending.md).

- **Migrating from an older version.** Read
  [`MIGRATION.md`](MIGRATION.md).

- **Browsing the JSON Schemas.** Read
  [`schemas.md`](schemas.md) for the field-by-field reference, or run
  `eval-toolkit schemas list` from the CLI.

The [methodology curriculum index](methodology/README.md) covers 16
chapters total — read them in order if you want the full conceptual
map.