Getting Started#

A linear walkthrough from “I have a trained model” to “I have a results.json a stakeholder can read.” Aimed at Python-fluent readers new to eval-toolkit; no prior sklearn-eval experience assumed.

If you’re already comfortable with sklearn-style evaluation, you can skim the conceptual sections (marked What is…) and read the code blocks directly.

Table of contents#

What is an eval, and what does this toolkit do?
Install
The Scorer concept
The EvalSlice concept
Run evaluate() and read the output
Persist results
(Optional) Validate the JSON
(Optional) Add a claim
(Optional) Render a plot
Common errors
Where to go next

What is an eval, and what does this toolkit do?#

An evaluation is the process of turning a model’s predictions into calibrated metrics with uncertainty. The numbers (PR-AUC, ROC-AUC, precision-at-recall-X) are the surface. The calibration (does the score 0.8 actually mean 80% chance of positive?) and the uncertainty (is the +5 pp PR-AUC lift over baseline likely real or noise?) are the substance.

This toolkit sits between two things you already have:

A model that produces probability scores. Could be sklearn, PyTorch, an API call to a hosted model, a regex — anything that takes inputs and returns P(positive).
Labeled data to evaluate it on. Rows with a binary label and a text (or feature) column.

What you get back:

Headline metrics (pr_auc, roc_auc, brier_score, …)
Bootstrap confidence intervals on those metrics
Per-slice breakdowns (dev vs test, by source, by strata)
Paired-difference CIs when comparing two models on the same rows
A reproducible manifest (git_sha, seed, GPU info, dataset hashes)
A results.json and manifest.json that downstream consumers can parse against a versioned JSON Schema

The toolkit does not ship report templates, dashboard renderers, or claim copy — those are domain-specific and belong in your consumer code.

Install#

pip install eval-toolkit

Or with optional extras:

pip install "eval-toolkit[dataframe,plotting,validation]"

Common extras:

dataframe — pandas. Required if you want to pass pd.DataFrame to EvalSlice (the easy path; this guide assumes it).
plotting — matplotlib + pillow. Required for the plot_* helpers.
validation — jsonschema. Required for validate_payload(...).
property — hypothesis. Only if you write property tests against the toolkit itself.
all — everything optional, the kitchen-sink install.

This guide uses dataframe and validation. Plotting is optional section (9).

The Scorer concept#

A Scorer is anything that exposes a predict_proba(X) method returning one probability per input row, where probability ∈ [0, 1] and represents P(positive class).

That’s the entire contract. It’s deliberately Protocol-based: you don’t subclass anything, you just implement the method. Your model class probably already does this (sklearn estimators do; transformers pipelines do not natively but it’s a one-liner wrapper).

Example: a minimal Scorer#

import numpy as np


class LengthScorer:
    """Scores longer texts higher. Useful only as a demo Scorer."""

    def predict_proba(self, X: list[str]) -> np.ndarray:
        # Map length to a [0, 1] score via a saturating function.
        lengths = np.array([len(x) for x in X], dtype=float)
        return lengths / (lengths + 10.0)


scorer = LengthScorer()
probs = scorer.predict_proba(["hi", "hello world"])
assert probs.shape == (2,)
assert (0.0 <= probs).all() and (probs <= 1.0).all()

That’s a fully valid Scorer. No registration, no base class.

If you have an sklearn pipeline:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# sklearn pipelines already implement predict_proba(X) → (n, n_classes).
# Wrap to return only the positive-class column.

class SklearnBinaryScorer:
    def __init__(self, pipe):
        self.pipe = pipe

    def predict_proba(self, X) -> np.ndarray:
        return self.pipe.predict_proba(X)[:, 1]


pipe = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("clf", LogisticRegression(max_iter=200)),
])
# (you'd fit pipe on training data here)

If your model is async / behind an API: cache the responses upfront, then have predict_proba look up the cached scores. The toolkit doesn’t care.

The EvalSlice concept#

An EvalSlice is the unit of evaluation: a named, labeled subset of data that you want metrics computed on. You typically have several:

dev and test (the standard split)
by_source (predictions on different data sources)
by_strata (predictions on different label-balanced strata)
OOD slices, regression slices, stress-test slices, etc.

Each slice is constructed from a pandas DataFrame with at minimum a text and label column. label must be {0, 1}.

Example: building two slices#

import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice

# Synthetic dev set: 100 rows, balanced classes.
rng = np.random.default_rng(42)
n = 100
labels = rng.integers(0, 2, size=n)
# Texts whose length correlates with the label.
texts = [
    "x" * (3 + int(label) * 8 + int(rng.integers(0, 4)))
    for label in labels
]
dev_df = pd.DataFrame({"text": texts, "label": labels})

dev_slice = EvalSlice(name="dev", df=dev_df)
assert dev_slice.name == "dev"
assert len(dev_slice.df) == 100

The constructor validates the shape: text and label columns must exist, labels must be in {0, 1} (other label encodings raise a ValueError), and the DataFrame must be non-empty.

If you have multiple sources to evaluate per-source:

import pandas as pd
from eval_toolkit import EvalSlice

# Tag each row with its source, then build one slice per source.
df = pd.DataFrame({
    "text": ["a", "b", "c", "d", "e", "f"],
    "label": [0, 1, 0, 1, 0, 1],
    "source": ["A", "A", "B", "B", "C", "C"],
})

slices = [
    EvalSlice(name=f"source_{src}", df=sub.reset_index(drop=True))
    for src, sub in df.groupby("source")
]
assert len(slices) == 3

Run `evaluate()` and read the output#

evaluate(...) is the orchestrator. Given a mapping of scorers and a list of slices, it computes the full headline-metric battery per (slice, scorer) pair, runs bootstrap CIs, and returns a RunResult.

import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, evaluate


class LengthScorer:
    def predict_proba(self, X):
        lengths = np.array([len(x) for x in X], dtype=float)
        return lengths / (lengths + 10.0)


rng = np.random.default_rng(0)
n = 100
labels = rng.integers(0, 2, size=n)
texts = ["x" * (3 + int(label) * 8) for label in labels]
df = pd.DataFrame({"text": texts, "label": labels})
dev_slice = EvalSlice(name="dev", df=df)

result = evaluate(
    {"length": LengthScorer()},
    [dev_slice],
    run_id="demo-run",
    n_resamples=50,  # small for the doctest; use 1000+ in real runs
    seed=42,
)

assert result.run_id == "demo-run"
assert "dev" in result.by_slice

Reading the output#

result.by_slice is a nested dict:

by_slice
├── "dev"
│   ├── "n"          : 100
│   ├── "n_positive" : ~50 (depends on RNG)
│   ├── "by_scorer"
│   │   └── "length"
│   │       ├── "pr_auc"           : float in [0, 1]
│   │       ├── "roc_auc"          : float in [0, 1]
│   │       ├── "pr_auc_ci"        : BootstrapCI dict
│   │       │   ├── "point_estimate" : float
│   │       │   ├── "ci_95"          : [low, high]   (or "skipped" if n<30)
│   │       │   ├── "confidence"     : 0.95
│   │       │   ├── "n_resamples"    : 50
│   │       │   └── "method"         : "BCa" | "percentile"
│   │       ├── "ece"              : float (expected calibration error)
│   │       └── ...                 (other metrics, plus operating_points)
│   └── "paired_diffs" : {}  (empty unless paired_diffs= explicitly set)

Access a metric:

import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, evaluate


class _Scorer:
    def predict_proba(self, X):
        return np.array([len(x) / (len(x) + 10) for x in X])


# Bootstrap CIs require n >= 30; use a bigger slice than the toy 3-row.
rng = np.random.default_rng(0)
n = 40
labels = rng.integers(0, 2, size=n)
texts = ["x" * (3 + int(label) * 8) for label in labels]
df = pd.DataFrame({"text": texts, "label": labels})
result = evaluate({"m": _Scorer()}, [EvalSlice(name="dev", df=df)], run_id="r", n_resamples=20)

pr_auc = result.by_slice["dev"]["by_scorer"]["m"]["pr_auc"]
ci = result.by_slice["dev"]["by_scorer"]["m"]["pr_auc_ci"]
assert 0.0 <= pr_auc <= 1.0
# ci is a BootstrapCI dict with point_estimate + ci_95 [low, high]
assert "ci_95" in ci or ci.get("status") == "skipped"

Comparing two scorers#

When you want a paired-difference CI between two scorers on the same rows, pass paired_diffs=[(baseline, candidate)] to evaluate(...):

import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, evaluate


class A:
    def predict_proba(self, X):
        return np.array([0.3 + 0.4 * (i % 2) for i in range(len(X))])


class B:
    def predict_proba(self, X):
        return np.array([0.4 + 0.5 * (i % 2) for i in range(len(X))])


df = pd.DataFrame({"text": ["x"] * 40, "label": [0, 1] * 20})
result = evaluate(
    {"a": A(), "b": B()},
    [EvalSlice(name="dev", df=df)],
    run_id="r",
    n_resamples=20,
    paired_diffs=[("a", "b")],  # explicit baseline → candidate pair
)
diffs = result.by_slice["dev"]["paired_diffs"]
assert ("a", "b") in diffs or "a__minus__b" in diffs or len(diffs) >= 1

Persist results#

RunResult.to_dict() produces a strict-JSON-safe payload:

import json
import tempfile
from pathlib import Path

import numpy as np
import pandas as pd

from eval_toolkit import EvalSlice, evaluate
from eval_toolkit.artifacts import write_json_strict


class _S:
    def predict_proba(self, X):
        return np.linspace(0.1, 0.9, len(X))


df = pd.DataFrame({"text": [f"row_{i}" for i in range(10)], "label": [0, 1] * 5})
result = evaluate({"m": _S()}, [EvalSlice(name="dev", df=df)], run_id="demo", n_resamples=10)

out_path = Path(tempfile.gettempdir()) / "demo_results.json"
write_json_strict(result.to_dict(), out_path)

# What the on-disk JSON looks like:
data = json.loads(out_path.read_text())
assert data["run_id"] == "demo"
assert "schema_version" in data

write_json_strict uses allow_nan=False and runs the payload through sanitize_for_json first — NaN / Inf becomes a structured skipped_metric(...) payload rather than producing invalid JSON.

(Optional) Validate the JSON#

Validate against the bundled JSON Schema to catch shape regressions between your harness and consumer parsers:

# Requires: pip install "eval-toolkit[validation]"
import json
import tempfile
from pathlib import Path

import numpy as np
import pandas as pd

from eval_toolkit import EvalSlice, evaluate
from eval_toolkit.artifacts import validate_payload, write_json_strict


class _S:
    def predict_proba(self, X):
        return np.linspace(0.1, 0.9, len(X))


df = pd.DataFrame({"text": [f"r{i}" for i in range(10)], "label": [0, 1] * 5})
result = evaluate({"m": _S()}, [EvalSlice(name="dev", df=df)], run_id="demo", n_resamples=10)

# This is a no-op on success; raises jsonschema.ValidationError on a bad shape.
validate_payload(result.to_dict(), schema_name="results.v1.json")

You can also validate from the CLI without writing Python:

eval-toolkit validate run_dir/results.json results.v1

(See docs/schemas.md for the field-by-field reference.)

(Optional) Add a claim#

A claim is a release-time go/no-go assertion: “PR-AUC is supported on the dev slice with at least 100 positives and 100 negatives, and the metric value is above 0.7.” Claims are not exploratory metrics — they’re preregistered preconditions that the renderer reads to decide whether to print “we claim X” or “we cannot claim X.”

import numpy as np
import pandas as pd

from eval_toolkit import EvalSlice, evaluate
from eval_toolkit.claims import (
    ClaimSpec,
    evaluate_claims,
    metric_threshold_gate,
    minimum_slice_size_gate,
    required_metric_gate,
)
from eval_toolkit.harness import with_claim_report


class _S:
    def predict_proba(self, X):
        # Score = 0.9 if label-marker, else 0.1
        return np.array([0.9 if "P" in x else 0.1 for x in X])


df = pd.DataFrame({
    "text": ["P_a", "P_b", "N_a", "N_b"] * 50,
    "label": [1, 1, 0, 0] * 50,
})
result = evaluate({"m": _S()}, [EvalSlice(name="dev", df=df)], run_id="demo", n_resamples=20)

claim = ClaimSpec(
    name="dev_pr_auc_supported",
    gates=(
        required_metric_gate("dev", "m", "pr_auc"),
        minimum_slice_size_gate("dev", min_n=100, min_positive=20, min_negative=20),
        metric_threshold_gate("dev", "m", "pr_auc", op=">=", threshold=0.7),
    ),
)

report = evaluate_claims(result, [claim])
assert report.has_failures() is False

# Attach the claim report to the RunResult for the renderer to read:
result_with_claim = with_claim_report(result, report)
assert result_with_claim.claim_report is not None

Each of the three gate calls above (required_metric_gate, minimum_slice_size_gate, metric_threshold_gate) is a factory that returns an EvidenceGate instance — a frozen dataclass bundling a callable check, a name, and a severity. Custom gates are written by constructing EvidenceGate directly with your own check function; the claims_and_gates example walks through both reference and custom gates end-to-end.

See methodology/claims.md for the full contract — exception handling, severity policy, custom gates.

(Optional) Render a plot#

# Requires: pip install "eval-toolkit[plotting]"
import matplotlib
matplotlib.use("Agg")  # non-interactive backend for headless / docs runs

import numpy as np
import tempfile
from pathlib import Path

from eval_toolkit.plotting import plot_metric_bars, save_figure

# Synthetic per-scorer metric summary:
values = {"baseline": 0.65, "candidate_v1": 0.78, "candidate_v2": 0.82}
fig = plot_metric_bars(values, ylabel="PR-AUC", title="Dev slice")
out_path = Path(tempfile.gettempdir()) / "pr_auc_bars.png"
saved = save_figure(fig, out_path)
assert saved.exists()

The plotting module’s API and visual conventions are documented in each helper’s docstring. See eval_toolkit.plotting.__all__ for the full list (plot_pr_curve, plot_reliability_diagram, plot_confusion_matrix_grid, plot_score_histograms, plot_lift_ci, plot_bootstrap_distribution).

Common errors#

A handful of mistakes are statistically more likely than the rest when you’re starting out:

`ValueError: labels must be in {0, 1}`#

Your DataFrame has labels other than 0 / 1 — strings, booleans encoded as integers, or -1 sentinel values. eval-toolkit treats binary classification as {0, 1} only.

Fix: convert before constructing the slice.

import pandas as pd

raw = pd.DataFrame({"text": ["a", "b"], "label": ["pos", "neg"]})
raw["label"] = (raw["label"] == "pos").astype(int)
# Now raw["label"] is {0, 1}.
assert set(raw["label"]) <= {0, 1}

`KeyError: missing strata column 'X'`#

You passed strata_col="X" to EvalSlice but the DataFrame has no column named X. Either remove the strata_col= argument or add the column.

Bootstrap CIs are very wide#

Either n_resamples is too low (default in this guide is 50 for docs-speed; use 1000+ in real runs), or your slice has very few positives or negatives. The CI width is a function of both the resampling budget and the underlying sample size — adding more resamples won’t help if you only have 5 positives.

Check the slice composition:

import pandas as pd
from eval_toolkit import EvalSlice

df = pd.DataFrame({"text": ["a", "b", "c"], "label": [0, 0, 1]})
slc = EvalSlice(name="dev", df=df)
n_positive = int(slc.df["label"].sum())
n_negative = len(slc.df) - n_positive
assert n_positive >= 1 and n_negative >= 1  # else PR-AUC is undefined

`RuntimeError: PR curve has no thresholds`#

Your predict_proba returned a constant value for every input. PR / ROC curves are undefined for a single threshold. Fix: check that your model isn’t outputting the same score for every row.

`'TYPE_CHECKING' import error` for pandas#

The dataframe extra (pip install "eval-toolkit[dataframe]") installs pandas. Without it, you can still use EvalSlice with DataFrames — pandas is a soft dep — but import pandas will fail in your harness code. Install the extra if you’re using DataFrames at all (this guide assumes you are).

Where to go next#

You now have a working RunResult and results.json. Recommended next reading depending on what you’re doing:

Building a real eval pipeline. Read three methodology chapters in this order:
1. leakage.md — making sure your eval data isn’t contaminated by training data.
2. splits.md — choosing between holdout and K-fold, source-disjoint splitting.
3. thresholds.md — picking a decision threshold once your scorer ranks well.
Adding release-time claims. Read methodology/claims.md for the full gate contract and severity policy.
Replaying old evals. Read methodology/artifacts.md for the PredictionArtifactRef contract that lets you recompute metrics without re-running inference.
Writing a custom Scorer/Splitter/Gate. Read extending.md.
Migrating from an older version. Read MIGRATION.md.
Browsing the JSON Schemas. Read schemas.md for the field-by-field reference, or run eval-toolkit schemas list from the CLI.

The methodology curriculum index covers 16 chapters total — read them in order if you want the full conceptual map.