Testing your evaluation code#
Background (skip if you’ve internalized this). Evaluation code is also code, and bugs in it produce subtler errors than bugs in models — a metric that’s silently wrong looks plausible because “metric numbers are just numbers”. Four test patterns catch the failure modes that matter: (1) property tests catch mathematical invariant violations (e.g., ROC-AUC + ROC-AUC of inverted scores should sum to 1); (2) reference-equivalence tests catch divergence from canonical implementations (sklearn, scipy); (3) golden tests catch regressions in deterministic outputs; (4) visual-regression tests catch plot drift.
eval-toolkit’s own test suite uses all four. This chapter describes the patterns so consumer projects (and downstream agents) can copy them.
The 4 patterns at a glance#
Pattern |
What it catches |
Library |
When to use |
|---|---|---|---|
Property |
Math-invariant violations |
hypothesis |
Every public function with a known invariant |
Reference-equivalence |
Divergence from canonical impls |
pytest + sklearn |
Every wrapped function (one per kernel) |
Golden |
Regressions in deterministic outputs |
pytest + JSON snapshot |
Anchor-based docs, JSON serializers |
Visual regression |
Plot rendering drift |
pytest-mpl |
Every plotting function |
eval-toolkit’s tests/ mirrors this taxonomy:
tests/
test_*_unit.py ← reference-equivalence + smoke
test_*_props.py ← property tests (hypothesis)
test_docs_golden.py ← golden tests (snapshot)
test_plotting_visual.py ← pytest-mpl
test_schemas.py ← jsonschema validation (golden-ish)
strategies.py ← shared hypothesis strategies
Property tests#
A property is a mathematical statement that should hold for every valid input. Hypothesis generates inputs and checks the property; if it finds a counterexample, it shrinks it to a minimal failing case.
The toolkit’s tests/strategies.py exports two reusable strategies:
# tests/strategies.py
# from hypothesis import strategies as st
# def balanced_binary_array(size): ... # generates (n,) binary arrays with both classes
# def score_array(size): ... # generates (n,) float arrays in [0, 1]
A property test has the shape:
import numpy as np
import pytest
from eval_toolkit import roc_auc
# Real test (mirrors tests/test_metrics_props.py:test_auroc_inversion):
@pytest.mark.unit
def test_auroc_inversion_demo() -> None:
"""ROC-AUC is anti-symmetric in score sign: roc_auc(y, -s) = 1 - roc_auc(y, s)."""
rng = np.random.default_rng(0)
for _ in range(10): # mini sweep instead of hypothesis for the doc-block
y = rng.integers(0, 2, size=80)
if len(set(y.tolist())) < 2:
continue
s = rng.uniform(-1, 1, size=80)
assert roc_auc(y, s) == pytest.approx(1.0 - roc_auc(y, -s), abs=1e-9)
test_auroc_inversion_demo() # smoke-call so Sybil verifies it
print("AUROC inversion property holds")
The full property-test version uses Hypothesis decorators —
@given(...) + @settings(deadline=None, max_examples=30) — to
generate hundreds of cases per run. See
tests/test_metrics_props.py for the canonical templates.
Properties worth testing#
For every metric:
Monotone-transform invariance. ROC-AUC is invariant to monotone transformations of the score:
roc_auc(y, s) == roc_auc(y, f(s))for any monotone f.Boundedness. PR-AUC ∈ [0, 1]; ECE ∈ [0, 1].
Inversion symmetry.
roc_auc(y, -s) = 1 - roc_auc(y, s).Single-class behavior. Returns a
"skipped"marker (toolkit convention), not a silent NaN or zero.Empty / size-1 input. Raises
ValueError, not a confusingIndexError.
For threshold selectors:
MaxF1Selectorreturns F1 ≥ F1 at any other threshold on the PR curve. This is the optimality property from Lipton-Elkan 2014.TargetRecallSelector(p)returns a threshold whose recall ≥ p.
For leakage checks:
run_leakage_checks([], splits)returns an empty LeakageReport.has_errors() == False.A clean fixture produces no error-severity findings with n_affected > 0.
Reference-equivalence tests#
When the toolkit wraps a canonical library function, an equivalence test pins the wrapping faithful: it runs the wrapper and the canonical on the same input and asserts numerical equality (within tolerance).
import numpy as np
from sklearn.metrics import average_precision_score
from eval_toolkit import pr_auc
# Real test pattern (tests/test_metrics_unit.py:test_pr_auc_matches_sklearn):
rng = np.random.default_rng(42)
y = rng.integers(0, 2, size=200)
s = rng.uniform(0, 1, size=200)
assert pr_auc(y, s) == np.float64(average_precision_score(y, s)).astype(float)
print("pr_auc matches sklearn.average_precision_score exactly")
The toolkit’s policy: every kernel that wraps a canonical implementation gets one equivalence test. From the v0.3 research audit:
Reference-impl tests are sparse — only
pr_aucis value-equality-tested against its sklearn equivalent. For a “wraps and validates” library, one such test per wrapped function is the table-stakes contract.
This is a known v0.7 gap closing in PR 1.5 alongside the property tests for the new modules.
Golden tests#
A golden test pins a deterministic output to a snapshot file. Running the test re-runs the deterministic computation and compares it to the snapshot. If a code change unintentionally changes the output, the test fails loudly.
The toolkit uses golden tests for:
The anchor-based markdown rendering in
eval_toolkit.docs.JSON schema validation of
results.json/results_full.json/manifest.jsonoutputs against the v1 schemas insrc/eval_toolkit/schemas/(seetests/test_schemas.py).
Pattern:
# Sketch — see tests/test_schemas.py for the real one.
import json
import tempfile
from pathlib import Path
import pandas as pd
from jsonschema import Draft202012Validator
import eval_toolkit
from eval_toolkit import EvalSlice, evaluate, write_run_result
class _Scorer:
def predict_proba(self, X):
import numpy as np
return np.full(len(X), 0.5)
# Run an evaluation (deterministic output for fixed seed).
df = pd.DataFrame({"text": ["a", "b", "c"], "label": [0, 1, 0]})
slice_ = EvalSlice(name="test", df=df)
result = evaluate({"s": _Scorer()}, [slice_], run_id="demo")
with tempfile.TemporaryDirectory() as d:
compact_path, _ = write_run_result(result, Path(d))
loaded = json.loads(compact_path.read_text())
# Schema lives at src/eval_toolkit/schemas/results.v1.json.
schema_path = Path(eval_toolkit.__file__).parent / "schemas" / "results.v1.json"
schema = json.loads(schema_path.read_text())
Draft202012Validator(schema).validate(loaded)
print("results.json validates against v1 schema")
Schema validation is more robust than literal JSON snapshots — it allows additive changes (adding new optional fields) without breaking the test, while still catching breaking schema changes.
Visual regression#
Plotting code is hard to test with assertions — “the plot looks right”
isn’t a numeric predicate. pytest-mpl saves baseline PNGs and
pixel-compares on each run.
# tests/test_plotting_visual.py
@pytest.mark.mpl_image_compare(baseline_dir="baseline")
def test_pr_curve_matches_baseline():
fig = plot_pr_curve(...)
return fig
Run with pytest --mpl --mpl-baseline-path=tests/baseline. Failures
write a diff PNG; the developer inspects, accepts (regenerate
baseline) or fixes the plot code.
The pattern is mature enough that the toolkit’s CI runs it on every commit. Consumer projects with their own plots should adopt the same pattern (it’s free coverage for hard-to-assert code).
Putting it all together#
A well-tested evaluation module has all four:
my_eval/
metrics.py
tests/
test_metrics_unit.py # reference-equivalence vs sklearn
test_metrics_props.py # hypothesis @given invariants
test_outputs_golden.py # JSON schema validation on serialized results
test_plots_visual.py # pytest-mpl on plotting functions
baseline/
test_*.png # pytest-mpl baselines
This is the structure eval-toolkit’s own tests/ follows. Consumer
projects copying this pattern get the same defensive depth: math-
invariant correctness, behavioral parity with canonical libraries,
serialization stability, and visual stability.
Pitfalls / Common mistakes#
Property tests with no actual invariants. “PR-AUC is a number” is not a property. The property tests in
tests/test_metrics_props.pyare templates of the real invariants (inversion, boundedness, monotone-invariance) — pattern-match those.Reference tests with NaN tolerance. sklearn / scipy versions drift; allow
pytest.approx(..., rel=1e-6)-level tolerance, not bit equality. Bit equality breaks across numpy / blas updates.Golden snapshots that include non-deterministic fields. A
RunResultsnapshot that includesgit_shaorwall_clock_secondsfails on every commit. Use schema validation (allows additive fields) or strip non-deterministic fields before snapshotting.Property tests that hit external services. Hypothesis generates hundreds of inputs per run; if your test calls an LLM API per case, the bill is real. Use the smoke-test pattern (10 hand-picked inputs) for expensive predictors.
Visual regression baselines committed without review. A pytest-mpl failure can be silenced by re-generating the baseline. Treat baseline regeneration as a code change subject to PR review, not a “fix the test” reflex.
Further reading#
Hypothesis docs: https://hypothesis.readthedocs.io/
pytest-mpl docs: https://pytest-mpl.readthedocs.io/
jsonschema docs: https://python-jsonschema.readthedocs.io/
Property-Based Testing for Stats Code, PyData 2018 (Christopher Armstrong) — taxonomy of stats-code-relevant properties.
The v0.3 research audit (
docs/v0.3_research_audit.md) — catalogs the toolkit’s own kernel-level reference-equivalence gaps.
See also: reproducibility.md (golden tests need deterministic outputs).