Testing your evaluation code#

Background (skip if you’ve internalized this). Evaluation code is also code, and bugs in it produce subtler errors than bugs in models — a metric that’s silently wrong looks plausible because “metric numbers are just numbers”. Four test patterns catch the failure modes that matter: (1) property tests catch mathematical invariant violations (e.g., ROC-AUC + ROC-AUC of inverted scores should sum to 1); (2) reference-equivalence tests catch divergence from canonical implementations (sklearn, scipy); (3) golden tests catch regressions in deterministic outputs; (4) visual-regression tests catch plot drift.

eval-toolkit’s own test suite uses all four. This chapter describes the patterns so consumer projects (and downstream agents) can copy them.

The 4 patterns at a glance#

Pattern	What it catches	Library	When to use
Property	Math-invariant violations	hypothesis	Every public function with a known invariant
Reference-equivalence	Divergence from canonical impls	pytest + sklearn	Every wrapped function (one per kernel)
Golden	Regressions in deterministic outputs	pytest + JSON snapshot	Anchor-based docs, JSON serializers
Visual regression	Plot rendering drift	pytest-mpl	Every plotting function

eval-toolkit’s tests/ mirrors this taxonomy:

tests/
  test_*_unit.py        ← reference-equivalence + smoke
  test_*_props.py       ← property tests (hypothesis)
  test_docs_golden.py   ← golden tests (snapshot)
  test_plotting_visual.py ← pytest-mpl
  test_schemas.py       ← jsonschema validation (golden-ish)
  strategies.py         ← shared hypothesis strategies

Property tests#

A property is a mathematical statement that should hold for every valid input. Hypothesis generates inputs and checks the property; if it finds a counterexample, it shrinks it to a minimal failing case.

The toolkit’s tests/strategies.py exports two reusable strategies:

# tests/strategies.py
# from hypothesis import strategies as st
# def balanced_binary_array(size): ...     # generates (n,) binary arrays with both classes
# def score_array(size): ...                # generates (n,) float arrays in [0, 1]

A property test has the shape:

import numpy as np
import pytest
from eval_toolkit.metrics import roc_auc

# Real test (mirrors tests/test_metrics_props.py:test_auroc_inversion):
@pytest.mark.unit
def test_auroc_inversion_demo() -> None:
    """ROC-AUC is anti-symmetric in score sign: roc_auc(y, -s) = 1 - roc_auc(y, s)."""
    rng = np.random.default_rng(0)
    for _ in range(10):  # mini sweep instead of hypothesis for the doc-block
        y = rng.integers(0, 2, size=80)
        if len(set(y.tolist())) < 2:
            continue
        s = rng.uniform(-1, 1, size=80)
        assert roc_auc(y, s) == pytest.approx(1.0 - roc_auc(y, -s), abs=1e-9)

test_auroc_inversion_demo()  # smoke-call so Sybil verifies it
print("AUROC inversion property holds")

The full property-test version uses Hypothesis decorators — @given(...) + @settings(deadline=None, max_examples=30) — to generate hundreds of cases per run. See tests/test_metrics_props.py for the canonical templates.

Properties worth testing#

For every metric:

Monotone-transform invariance. ROC-AUC is invariant to monotone transformations of the score: roc_auc(y, s) == roc_auc(y, f(s)) for any monotone f.
Boundedness. PR-AUC ∈ [0, 1]; ECE ∈ [0, 1].
Inversion symmetry. roc_auc(y, -s) = 1 - roc_auc(y, s).
Single-class behavior. Returns a "skipped" marker (toolkit convention), not a silent NaN or zero.
Empty / size-1 input. Raises ValueError, not a confusing IndexError.

For threshold selectors:

MaxF1Selector returns F1 ≥ F1 at any other threshold on the PR curve. This is the optimality property from Lipton-Elkan 2014.
TargetRecallSelector(p) returns a threshold whose recall ≥ p.

For leakage checks:

run_leakage_checks([], splits) returns an empty LeakageReport.has_errors() == False.
A clean fixture produces no error-severity findings with n_affected > 0.

Reference-equivalence tests#

When the toolkit wraps a canonical library function, an equivalence test pins the wrapping faithful: it runs the wrapper and the canonical on the same input and asserts numerical equality (within tolerance).

import numpy as np
from sklearn.metrics import average_precision_score
from eval_toolkit.metrics import pr_auc

# Real test pattern (tests/test_metrics_unit.py:test_pr_auc_matches_sklearn):
rng = np.random.default_rng(42)
y = rng.integers(0, 2, size=200)
s = rng.uniform(0, 1, size=200)

assert pr_auc(y, s) == np.float64(average_precision_score(y, s)).astype(float)
print("pr_auc matches sklearn.average_precision_score exactly")

The toolkit’s policy: every kernel that wraps a canonical implementation gets one equivalence test. From the v0.3 research audit:

Reference-impl tests are sparse — only pr_auc is value-equality-tested against its sklearn equivalent. For a “wraps and validates” library, one such test per wrapped function is the table-stakes contract.

This audit framing is preserved for historical context. The gap has since closed: reference-equivalence tests now ship for pr_auc, roc_auc, brier_score, reliability_curve, bootstrap_ci, fit_isotonic_calibrator, and fit_platt_calibrator (see roadmap.md “Currently shipped”). The wrapping-faithful contract is a steady-state invariant going forward.

Golden tests#

A golden test pins a deterministic output to a snapshot file. Running the test re-runs the deterministic computation and compares it to the snapshot. If a code change unintentionally changes the output, the test fails loudly.

The toolkit uses golden tests for:

The anchor-based markdown rendering in eval_toolkit.docs.
JSON schema validation of results.json / results_full.json / manifest.json outputs against the v1 schemas in src/eval_toolkit/schemas/ (see tests/test_schemas.py).

Pattern:

# Sketch — see tests/test_schemas.py for the real one.
import json
import tempfile
from pathlib import Path
import pandas as pd
from jsonschema import Draft202012Validator

import eval_toolkit
from eval_toolkit import EvalSlice, evaluate, write_run_result

class _Scorer:
    def predict_proba(self, X):
        import numpy as np
        return np.full(len(X), 0.5)

# Run an evaluation (deterministic output for fixed seed).
df = pd.DataFrame({"text": ["a", "b", "c"], "label": [0, 1, 0]})
slice_ = EvalSlice(name="test", df=df)
result = evaluate({"s": _Scorer()}, [slice_], run_id="demo")

with tempfile.TemporaryDirectory() as d:
    compact_path, _ = write_run_result(result, Path(d))
    loaded = json.loads(compact_path.read_text())

# Schema lives at src/eval_toolkit/schemas/results.v1.json.
schema_path = Path(eval_toolkit.__file__).parent / "schemas" / "results.v1.json"
schema = json.loads(schema_path.read_text())
Draft202012Validator(schema).validate(loaded)
print("results.json validates against v1 schema")

Schema validation is more robust than literal JSON snapshots — it allows additive changes (adding new optional fields) without breaking the test, while still catching breaking schema changes.

Visual regression#

Plotting code is hard to test with assertions — “the plot looks right” isn’t a numeric predicate. pytest-mpl saves baseline PNGs and pixel-compares on each run.

# tests/test_plotting_visual.py
@pytest.mark.mpl_image_compare(baseline_dir="baseline")
def test_pr_curve_matches_baseline():
    fig = plot_pr_curve(...)
    return fig

Run with pytest --mpl --mpl-baseline-path=tests/baseline. Failures write a diff PNG; the developer inspects, accepts (regenerate baseline) or fixes the plot code.

The pattern is mature enough that the toolkit’s CI runs it on every commit. Consumer projects with their own plots should adopt the same pattern (it’s free coverage for hard-to-assert code).

Putting it all together#

A well-tested evaluation module has all four:

my_eval/
  metrics.py
tests/
  test_metrics_unit.py     # reference-equivalence vs sklearn
  test_metrics_props.py    # hypothesis @given invariants
  test_outputs_golden.py   # JSON schema validation on serialized results
  test_plots_visual.py     # pytest-mpl on plotting functions
  baseline/
    test_*.png             # pytest-mpl baselines

This is the structure eval-toolkit’s own tests/ follows. Consumer projects copying this pattern get the same defensive depth: math- invariant correctness, behavioral parity with canonical libraries, serialization stability, and visual stability.

Pitfalls / Common mistakes#

Property tests with no actual invariants. “PR-AUC is a number” is not a property. The property tests in tests/test_metrics_props.py are templates of the real invariants (inversion, boundedness, monotone-invariance) — pattern-match those.
Reference tests with NaN tolerance. sklearn / scipy versions drift; allow pytest.approx(..., rel=1e-6)-level tolerance, not bit equality. Bit equality breaks across numpy / blas updates.
Golden snapshots that include non-deterministic fields. A RunResult snapshot that includes git_sha or wall_clock_seconds fails on every commit. Use schema validation (allows additive fields) or strip non-deterministic fields before snapshotting.
Property tests that hit external services. Hypothesis generates hundreds of inputs per run; if your test calls an LLM API per case, the bill is real. Use the smoke-test pattern (10 hand-picked inputs) for expensive predictors.
Visual regression baselines committed without review. A pytest-mpl failure can be silenced by re-generating the baseline. Treat baseline regeneration as a code change subject to PR review, not a “fix the test” reflex.