Getting Started#
A linear walkthrough from “I have a trained model” to “I have a
results.json a stakeholder can read.” Aimed at Python-fluent readers
new to eval-toolkit; no prior sklearn-eval experience assumed.
If you’re already comfortable with sklearn-style evaluation, you can skim the conceptual sections (marked What is…) and read the code blocks directly.
Table of contents#
What is an eval, and what does this toolkit do?#
An evaluation is the process of turning a model’s predictions into calibrated metrics with uncertainty. The numbers (PR-AUC, ROC-AUC, precision-at-recall-X) are the surface. The calibration (does the score 0.8 actually mean 80% chance of positive?) and the uncertainty (is the +5 pp PR-AUC lift over baseline likely real or noise?) are the substance.
This toolkit sits between two things you already have:
A model that produces probability scores. Could be sklearn, PyTorch, an API call to a hosted model, a regex — anything that takes inputs and returns
P(positive).Labeled data to evaluate it on. Rows with a binary label and a text (or feature) column.
What you get back:
Headline metrics (
pr_auc,roc_auc,brier_score, …)Bootstrap confidence intervals on those metrics
Per-slice breakdowns (dev vs test, by source, by strata)
Paired-difference CIs when comparing two models on the same rows
A reproducible manifest (
git_sha, seed, GPU info, dataset hashes)A
results.jsonandmanifest.jsonthat downstream consumers can parse against a versioned JSON Schema
The toolkit does not ship report templates, dashboard renderers, or claim copy — those are domain-specific and belong in your consumer code.
Install#
pip install eval-toolkit
Or with optional extras:
pip install "eval-toolkit[dataframe,plotting,validation]"
Common extras:
dataframe—pandas. Required if you want to passpd.DataFrametoEvalSlice(the easy path; this guide assumes it).plotting—matplotlib+pillow. Required for theplot_*helpers.validation—jsonschema. Required forvalidate_payload(...).property—hypothesis. Only if you write property tests against the toolkit itself.all— everything optional, the kitchen-sink install.
This guide uses dataframe and validation. Plotting is optional
section (9).
The Scorer concept#
A Scorer is anything that exposes a predict_proba(X) method
returning one probability per input row, where probability ∈ [0, 1]
and represents P(positive class).
That’s the entire contract. It’s deliberately Protocol-based: you
don’t subclass anything, you just implement the method. Your model
class probably already does this (sklearn estimators do;
transformers pipelines do not natively but it’s a one-liner wrapper).
Example: a minimal Scorer#
import numpy as np
class LengthScorer:
"""Scores longer texts higher. Useful only as a demo Scorer."""
def predict_proba(self, X: list[str]) -> np.ndarray:
# Map length to a [0, 1] score via a saturating function.
lengths = np.array([len(x) for x in X], dtype=float)
return lengths / (lengths + 10.0)
scorer = LengthScorer()
probs = scorer.predict_proba(["hi", "hello world"])
assert probs.shape == (2,)
assert (0.0 <= probs).all() and (probs <= 1.0).all()
That’s a fully valid Scorer. No registration, no base class.
If you have an sklearn pipeline:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# sklearn pipelines already implement predict_proba(X) → (n, n_classes).
# Wrap to return only the positive-class column.
class SklearnBinaryScorer:
def __init__(self, pipe):
self.pipe = pipe
def predict_proba(self, X) -> np.ndarray:
return self.pipe.predict_proba(X)[:, 1]
pipe = Pipeline([
("tfidf", TfidfVectorizer()),
("clf", LogisticRegression(max_iter=200)),
])
# (you'd fit pipe on training data here)
If your model is async / behind an API: cache the responses upfront,
then have predict_proba look up the cached scores. The toolkit
doesn’t care.
The EvalSlice concept#
An EvalSlice is the unit of evaluation: a named, labeled
subset of data that you want metrics computed on. You typically have
several:
devandtest(the standard split)by_source(predictions on different data sources)by_strata(predictions on different label-balanced strata)OOD slices, regression slices, stress-test slices, etc.
Each slice is constructed from a pandas DataFrame with at minimum a
text and label column. label must be {0, 1}.
Example: building two slices#
import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice
# Synthetic dev set: 100 rows, balanced classes.
rng = np.random.default_rng(42)
n = 100
labels = rng.integers(0, 2, size=n)
# Texts whose length correlates with the label.
texts = [
"x" * (3 + int(label) * 8 + int(rng.integers(0, 4)))
for label in labels
]
dev_df = pd.DataFrame({"text": texts, "label": labels})
dev_slice = EvalSlice(name="dev", df=dev_df)
assert dev_slice.name == "dev"
assert len(dev_slice.df) == 100
The constructor validates the shape: text and label columns must
exist, labels must be in {0, 1} (other label encodings raise a
ValueError), and the DataFrame must be non-empty.
If you have multiple sources to evaluate per-source:
import pandas as pd
from eval_toolkit import EvalSlice
# Tag each row with its source, then build one slice per source.
df = pd.DataFrame({
"text": ["a", "b", "c", "d", "e", "f"],
"label": [0, 1, 0, 1, 0, 1],
"source": ["A", "A", "B", "B", "C", "C"],
})
slices = [
EvalSlice(name=f"source_{src}", df=sub.reset_index(drop=True))
for src, sub in df.groupby("source")
]
assert len(slices) == 3
Run evaluate() and read the output#
evaluate(...) is the orchestrator. Given a mapping of scorers and a
list of slices, it computes the full headline-metric battery per
(slice, scorer) pair, runs bootstrap CIs, and returns a RunResult.
import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, evaluate
class LengthScorer:
def predict_proba(self, X):
lengths = np.array([len(x) for x in X], dtype=float)
return lengths / (lengths + 10.0)
rng = np.random.default_rng(0)
n = 100
labels = rng.integers(0, 2, size=n)
texts = ["x" * (3 + int(label) * 8) for label in labels]
df = pd.DataFrame({"text": texts, "label": labels})
dev_slice = EvalSlice(name="dev", df=df)
result = evaluate(
{"length": LengthScorer()},
[dev_slice],
run_id="demo-run",
n_resamples=50, # small for the doctest; use 1000+ in real runs
seed=42,
)
assert result.run_id == "demo-run"
assert "dev" in result.by_slice
Reading the output#
result.by_slice is a nested dict:
by_slice
├── "dev"
│ ├── "n" : 100
│ ├── "n_positive" : ~50 (depends on RNG)
│ ├── "by_scorer"
│ │ └── "length"
│ │ ├── "pr_auc" : float in [0, 1]
│ │ ├── "roc_auc" : float in [0, 1]
│ │ ├── "pr_auc_ci" : BootstrapCI dict
│ │ │ ├── "point_estimate" : float
│ │ │ ├── "ci_95" : [low, high] (or "skipped" if n<30)
│ │ │ ├── "confidence" : 0.95
│ │ │ ├── "n_resamples" : 50
│ │ │ └── "method" : "BCa" | "percentile"
│ │ ├── "ece" : float (expected calibration error)
│ │ └── ... (other metrics, plus operating_points)
│ └── "paired_diffs" : {} (empty unless paired_diffs= explicitly set)
Access a metric:
import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, evaluate
class _Scorer:
def predict_proba(self, X):
return np.array([len(x) / (len(x) + 10) for x in X])
# Bootstrap CIs require n >= 30; use a bigger slice than the toy 3-row.
rng = np.random.default_rng(0)
n = 40
labels = rng.integers(0, 2, size=n)
texts = ["x" * (3 + int(label) * 8) for label in labels]
df = pd.DataFrame({"text": texts, "label": labels})
result = evaluate({"m": _Scorer()}, [EvalSlice(name="dev", df=df)], run_id="r", n_resamples=20)
pr_auc = result.by_slice["dev"]["by_scorer"]["m"]["pr_auc"]
ci = result.by_slice["dev"]["by_scorer"]["m"]["pr_auc_ci"]
assert 0.0 <= pr_auc <= 1.0
# ci is a BootstrapCI dict with point_estimate + ci_95 [low, high]
assert "ci_95" in ci or ci.get("status") == "skipped"
Comparing two scorers#
When you want a paired-difference CI between two scorers on the same
rows, pass paired_diffs=[(baseline, candidate)] to evaluate(...):
import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, evaluate
class A:
def predict_proba(self, X):
return np.array([0.3 + 0.4 * (i % 2) for i in range(len(X))])
class B:
def predict_proba(self, X):
return np.array([0.4 + 0.5 * (i % 2) for i in range(len(X))])
df = pd.DataFrame({"text": ["x"] * 40, "label": [0, 1] * 20})
result = evaluate(
{"a": A(), "b": B()},
[EvalSlice(name="dev", df=df)],
run_id="r",
n_resamples=20,
paired_diffs=[("a", "b")], # explicit baseline → candidate pair
)
diffs = result.by_slice["dev"]["paired_diffs"]
assert ("a", "b") in diffs or "a__minus__b" in diffs or len(diffs) >= 1
Persist results#
RunResult.to_dict() produces a strict-JSON-safe payload:
import json
import tempfile
from pathlib import Path
import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, evaluate
from eval_toolkit.artifacts import write_json_strict
class _S:
def predict_proba(self, X):
return np.linspace(0.1, 0.9, len(X))
df = pd.DataFrame({"text": [f"row_{i}" for i in range(10)], "label": [0, 1] * 5})
result = evaluate({"m": _S()}, [EvalSlice(name="dev", df=df)], run_id="demo", n_resamples=10)
out_path = Path(tempfile.gettempdir()) / "demo_results.json"
write_json_strict(result.to_dict(), out_path)
# What the on-disk JSON looks like:
data = json.loads(out_path.read_text())
assert data["run_id"] == "demo"
assert "schema_version" in data
write_json_strict uses allow_nan=False and runs the payload
through sanitize_for_json first — NaN / Inf becomes a structured
skipped_metric(...) payload rather than producing invalid JSON.
(Optional) Validate the JSON#
Validate against the bundled JSON Schema to catch shape regressions between your harness and consumer parsers:
# Requires: pip install "eval-toolkit[validation]"
import json
import tempfile
from pathlib import Path
import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, evaluate
from eval_toolkit.artifacts import validate_payload, write_json_strict
class _S:
def predict_proba(self, X):
return np.linspace(0.1, 0.9, len(X))
df = pd.DataFrame({"text": [f"r{i}" for i in range(10)], "label": [0, 1] * 5})
result = evaluate({"m": _S()}, [EvalSlice(name="dev", df=df)], run_id="demo", n_resamples=10)
# This is a no-op on success; raises jsonschema.ValidationError on a bad shape.
validate_payload(result.to_dict(), schema_name="results.v1.json")
You can also validate from the CLI without writing Python:
eval-toolkit validate run_dir/results.json results.v1
(See docs/schemas.md for the field-by-field reference.)
(Optional) Add a claim#
A claim is a release-time go/no-go assertion: “PR-AUC is supported on the dev slice with at least 100 positives and 100 negatives, and the metric value is above 0.7.” Claims are not exploratory metrics — they’re preregistered preconditions that the renderer reads to decide whether to print “we claim X” or “we cannot claim X.”
import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, evaluate
from eval_toolkit.claims import (
ClaimSpec,
evaluate_claims,
metric_threshold_gate,
minimum_slice_size_gate,
required_metric_gate,
)
from eval_toolkit.harness import with_claim_report
class _S:
def predict_proba(self, X):
# Score = 0.9 if label-marker, else 0.1
return np.array([0.9 if "P" in x else 0.1 for x in X])
df = pd.DataFrame({
"text": ["P_a", "P_b", "N_a", "N_b"] * 50,
"label": [1, 1, 0, 0] * 50,
})
result = evaluate({"m": _S()}, [EvalSlice(name="dev", df=df)], run_id="demo", n_resamples=20)
claim = ClaimSpec(
name="dev_pr_auc_supported",
gates=(
required_metric_gate("dev", "m", "pr_auc"),
minimum_slice_size_gate("dev", min_n=100, min_positive=20, min_negative=20),
metric_threshold_gate("dev", "m", "pr_auc", op=">=", threshold=0.7),
),
)
report = evaluate_claims(result, [claim])
assert report.has_failures() is False
# Attach the claim report to the RunResult for the renderer to read:
result_with_claim = with_claim_report(result, report)
assert result_with_claim.claim_report is not None
Each of the three gate calls above (required_metric_gate,
minimum_slice_size_gate, metric_threshold_gate) is a factory
that returns an EvidenceGate instance — a frozen dataclass bundling
a callable check, a name, and a severity. Custom gates are written by
constructing EvidenceGate directly with your own check function;
the claims_and_gates example walks
through both reference and custom gates end-to-end.
See methodology/claims.md for the full contract — exception handling, severity policy, custom gates.
(Optional) Render a plot#
# Requires: pip install "eval-toolkit[plotting]"
import matplotlib
matplotlib.use("Agg") # non-interactive backend for headless / docs runs
import numpy as np
import tempfile
from pathlib import Path
from eval_toolkit.plotting import plot_metric_bars, save_figure
# Synthetic per-scorer metric summary:
values = {"baseline": 0.65, "candidate_v1": 0.78, "candidate_v2": 0.82}
fig = plot_metric_bars(values, ylabel="PR-AUC", title="Dev slice")
out_path = Path(tempfile.gettempdir()) / "pr_auc_bars.png"
saved = save_figure(fig, out_path)
assert saved.exists()
The plotting module’s API and visual conventions are documented in
each helper’s docstring. See eval_toolkit.plotting.__all__ for the
full list (plot_pr_curve, plot_reliability_diagram,
plot_confusion_matrix_grid, plot_score_histograms, plot_lift_ci,
plot_bootstrap_distribution).
Common errors#
A handful of mistakes are statistically more likely than the rest when you’re starting out:
ValueError: labels must be in {0, 1}#
Your DataFrame has labels other than 0 / 1 — strings, booleans
encoded as integers, or -1 sentinel values. eval-toolkit treats
binary classification as {0, 1} only.
Fix: convert before constructing the slice.
import pandas as pd
raw = pd.DataFrame({"text": ["a", "b"], "label": ["pos", "neg"]})
raw["label"] = (raw["label"] == "pos").astype(int)
# Now raw["label"] is {0, 1}.
assert set(raw["label"]) <= {0, 1}
KeyError: missing strata column 'X'#
You passed strata_col="X" to EvalSlice but the DataFrame has no
column named X. Either remove the strata_col= argument or add the
column.
Bootstrap CIs are very wide#
Either n_resamples is too low (default in this guide is 50 for
docs-speed; use 1000+ in real runs), or your slice has very few
positives or negatives. The CI width is a function of both the
resampling budget and the underlying sample size — adding more
resamples won’t help if you only have 5 positives.
Check the slice composition:
import pandas as pd
from eval_toolkit import EvalSlice
df = pd.DataFrame({"text": ["a", "b", "c"], "label": [0, 0, 1]})
slc = EvalSlice(name="dev", df=df)
n_positive = int(slc.df["label"].sum())
n_negative = len(slc.df) - n_positive
assert n_positive >= 1 and n_negative >= 1 # else PR-AUC is undefined
RuntimeError: PR curve has no thresholds#
Your predict_proba returned a constant value for every input. PR /
ROC curves are undefined for a single threshold. Fix: check that your
model isn’t outputting the same score for every row.
'TYPE_CHECKING' import error for pandas#
The dataframe extra (pip install "eval-toolkit[dataframe]")
installs pandas. Without it, you can still use EvalSlice with
DataFrames — pandas is a soft dep — but import pandas will fail in
your harness code. Install the extra if you’re using DataFrames at all
(this guide assumes you are).
Where to go next#
You now have a working RunResult and results.json. Recommended
next reading depending on what you’re doing:
Building a real eval pipeline. Read three methodology chapters in this order:
leakage.md— making sure your eval data isn’t contaminated by training data.splits.md— choosing between holdout and K-fold, source-disjoint splitting.thresholds.md— picking a decision threshold once your scorer ranks well.
Adding release-time claims. Read
methodology/claims.mdfor the full gate contract and severity policy.Replaying old evals. Read
methodology/artifacts.mdfor thePredictionArtifactRefcontract that lets you recompute metrics without re-running inference.Writing a custom Scorer/Splitter/Gate. Read
extending.md.Migrating from an older version. Read
MIGRATION.md.Browsing the JSON Schemas. Read
schemas.mdfor the field-by-field reference, or runeval-toolkit schemas listfrom the CLI.
The methodology curriculum index covers 16 chapters total — read them in order if you want the full conceptual map.