# Testing your evaluation code

> **Background** *(skip if you've internalized this)*. Evaluation code is
> *also* code, and bugs in it produce subtler errors than bugs in
> models — a metric that's silently wrong looks plausible because
> "metric numbers are just numbers". Four test patterns catch the
> failure modes that matter: (1) property tests catch mathematical
> invariant violations (e.g., ROC-AUC + ROC-AUC of inverted scores
> should sum to 1); (2) reference-equivalence tests catch divergence
> from canonical implementations (sklearn, scipy); (3) golden tests
> catch regressions in deterministic outputs; (4) visual-regression
> tests catch plot drift.

eval-toolkit's own test suite uses all four. This chapter describes the
patterns so consumer projects (and downstream agents) can copy them.

## The 4 patterns at a glance

| Pattern | What it catches | Library | When to use |
|---|---|---|---|
| Property | Math-invariant violations | hypothesis | Every public function with a known invariant |
| Reference-equivalence | Divergence from canonical impls | pytest + sklearn | Every wrapped function (one per kernel) |
| Golden | Regressions in deterministic outputs | pytest + JSON snapshot | Anchor-based docs, JSON serializers |
| Visual regression | Plot rendering drift | pytest-mpl | Every plotting function |

eval-toolkit's `tests/` mirrors this taxonomy:

```
tests/
  test_*_unit.py        ← reference-equivalence + smoke
  test_*_props.py       ← property tests (hypothesis)
  test_docs_golden.py   ← golden tests (snapshot)
  test_plotting_visual.py ← pytest-mpl
  test_schemas.py       ← jsonschema validation (golden-ish)
  strategies.py         ← shared hypothesis strategies
```

(property-tests)=
## Property tests
A *property* is a mathematical statement that should hold for every
valid input. Hypothesis generates inputs and checks the property; if
it finds a counterexample, it shrinks it to a minimal failing case.

The toolkit's `tests/strategies.py` exports two reusable strategies:

```python
# tests/strategies.py
# from hypothesis import strategies as st
# def balanced_binary_array(size): ...     # generates (n,) binary arrays with both classes
# def score_array(size): ...                # generates (n,) float arrays in [0, 1]
```

A property test has the shape:

```python
import numpy as np
import pytest
from eval_toolkit import roc_auc

# Real test (mirrors tests/test_metrics_props.py:test_auroc_inversion):
@pytest.mark.unit
def test_auroc_inversion_demo() -> None:
    """ROC-AUC is anti-symmetric in score sign: roc_auc(y, -s) = 1 - roc_auc(y, s)."""
    rng = np.random.default_rng(0)
    for _ in range(10):  # mini sweep instead of hypothesis for the doc-block
        y = rng.integers(0, 2, size=80)
        if len(set(y.tolist())) < 2:
            continue
        s = rng.uniform(-1, 1, size=80)
        assert roc_auc(y, s) == pytest.approx(1.0 - roc_auc(y, -s), abs=1e-9)

test_auroc_inversion_demo()  # smoke-call so Sybil verifies it
print("AUROC inversion property holds")
```

The full property-test version uses Hypothesis decorators —
`@given(...)` + `@settings(deadline=None, max_examples=30)` — to
generate hundreds of cases per run. See
`tests/test_metrics_props.py` for the canonical templates.

### Properties worth testing

For every metric:

- **Monotone-transform invariance.** ROC-AUC is invariant to monotone
  transformations of the score: `roc_auc(y, s) == roc_auc(y, f(s))` for
  any monotone f.
- **Boundedness.** PR-AUC ∈ [0, 1]; ECE ∈ [0, 1].
- **Inversion symmetry.** `roc_auc(y, -s) = 1 - roc_auc(y, s)`.
- **Single-class behavior.** Returns a `"skipped"` marker (toolkit
  convention), not a silent NaN or zero.
- **Empty / size-1 input.** Raises `ValueError`, not a confusing
  `IndexError`.

For threshold selectors:

- **`MaxF1Selector` returns F1 ≥ F1 at any other threshold on the
  PR curve.** This is the optimality property from Lipton-Elkan 2014.
- **`TargetRecallSelector(p)` returns a threshold whose recall ≥ p.**

For leakage checks:

- **`run_leakage_checks([], splits)` returns an empty
  LeakageReport.has_errors() == False.**
- **A clean fixture produces no error-severity findings with
  n_affected > 0.**

(reference-equivalence)=
## Reference-equivalence tests
When the toolkit *wraps* a canonical library function, an equivalence
test pins the wrapping faithful: it runs the wrapper and the canonical
on the same input and asserts numerical equality (within tolerance).

```python
import numpy as np
from sklearn.metrics import average_precision_score
from eval_toolkit import pr_auc

# Real test pattern (tests/test_metrics_unit.py:test_pr_auc_matches_sklearn):
rng = np.random.default_rng(42)
y = rng.integers(0, 2, size=200)
s = rng.uniform(0, 1, size=200)

assert pr_auc(y, s) == np.float64(average_precision_score(y, s)).astype(float)
print("pr_auc matches sklearn.average_precision_score exactly")
```

The toolkit's policy: every kernel that wraps a canonical implementation
gets one equivalence test. From the v0.3 research audit:

> *Reference-impl tests are sparse — only `pr_auc` is value-equality-tested
> against its sklearn equivalent. For a "wraps and validates" library,
> one such test per wrapped function is the table-stakes contract.*

This is a known v0.7 gap closing in PR 1.5 alongside the property
tests for the new modules.

(golden-tests)=
## Golden tests
A golden test pins a deterministic output to a snapshot file. Running
the test re-runs the deterministic computation and compares it to the
snapshot. If a code change unintentionally changes the output, the test
fails loudly.

The toolkit uses golden tests for:

1. The anchor-based markdown rendering in
   [`eval_toolkit.docs`](../api/docs.md).
2. **JSON schema validation** of `results.json` /
   `results_full.json` / `manifest.json` outputs against the v1
   schemas in `src/eval_toolkit/schemas/` (see `tests/test_schemas.py`).

Pattern:

```python
# Sketch — see tests/test_schemas.py for the real one.
import json
import tempfile
from pathlib import Path
import pandas as pd
from jsonschema import Draft202012Validator

import eval_toolkit
from eval_toolkit import EvalSlice, evaluate, write_run_result

class _Scorer:
    def predict_proba(self, X):
        import numpy as np
        return np.full(len(X), 0.5)

# Run an evaluation (deterministic output for fixed seed).
df = pd.DataFrame({"text": ["a", "b", "c"], "label": [0, 1, 0]})
slice_ = EvalSlice(name="test", df=df)
result = evaluate({"s": _Scorer()}, [slice_], run_id="demo")

with tempfile.TemporaryDirectory() as d:
    compact_path, _ = write_run_result(result, Path(d))
    loaded = json.loads(compact_path.read_text())

# Schema lives at src/eval_toolkit/schemas/results.v1.json.
schema_path = Path(eval_toolkit.__file__).parent / "schemas" / "results.v1.json"
schema = json.loads(schema_path.read_text())
Draft202012Validator(schema).validate(loaded)
print("results.json validates against v1 schema")
```

Schema validation is more robust than literal JSON snapshots — it
allows additive changes (adding new optional fields) without breaking
the test, while still catching breaking schema changes.

(visual-regression)=
## Visual regression
Plotting code is hard to test with assertions — "the plot looks right"
isn't a numeric predicate. `pytest-mpl` saves baseline PNGs and
pixel-compares on each run.

```
# tests/test_plotting_visual.py
@pytest.mark.mpl_image_compare(baseline_dir="baseline")
def test_pr_curve_matches_baseline():
    fig = plot_pr_curve(...)
    return fig
```

Run with `pytest --mpl --mpl-baseline-path=tests/baseline`. Failures
write a diff PNG; the developer inspects, accepts (regenerate
baseline) or fixes the plot code.

The pattern is mature enough that the toolkit's CI runs it on every
commit. Consumer projects with their own plots should adopt the same
pattern (it's free coverage for hard-to-assert code).

## Putting it all together

A well-tested evaluation module has all four:

```
my_eval/
  metrics.py
tests/
  test_metrics_unit.py     # reference-equivalence vs sklearn
  test_metrics_props.py    # hypothesis @given invariants
  test_outputs_golden.py   # JSON schema validation on serialized results
  test_plots_visual.py     # pytest-mpl on plotting functions
  baseline/
    test_*.png             # pytest-mpl baselines
```

This is the structure eval-toolkit's own `tests/` follows. Consumer
projects copying this pattern get the same defensive depth: math-
invariant correctness, behavioral parity with canonical libraries,
serialization stability, and visual stability.

(testing-pitfalls)=
## Pitfalls / Common mistakes
- **Property tests with no actual invariants.** "PR-AUC is a number" is
  not a property. The property tests in
  `tests/test_metrics_props.py` are templates of the *real* invariants
  (inversion, boundedness, monotone-invariance) — pattern-match those.
- **Reference tests with NaN tolerance.** sklearn / scipy versions
  drift; allow `pytest.approx(..., rel=1e-6)`-level tolerance, not bit
  equality. Bit equality breaks across numpy / blas updates.
- **Golden snapshots that include non-deterministic fields.** A
  `RunResult` snapshot that includes `git_sha` or `wall_clock_seconds`
  fails on every commit. Use schema validation (allows additive
  fields) or strip non-deterministic fields before snapshotting.
- **Property tests that hit external services.** Hypothesis generates
  hundreds of inputs per run; if your test calls an LLM API per
  case, the bill is real. Use the smoke-test pattern (10 hand-picked
  inputs) for expensive predictors.
- **Visual regression baselines committed without review.** A
  pytest-mpl failure can be silenced by re-generating the baseline.
  Treat baseline regeneration as a code change subject to PR review,
  not a "fix the test" reflex.

## Further reading

- Hypothesis docs: https://hypothesis.readthedocs.io/
- pytest-mpl docs: https://pytest-mpl.readthedocs.io/
- jsonschema docs: https://python-jsonschema.readthedocs.io/
- *Property-Based Testing for Stats Code,* PyData 2018 (Christopher
  Armstrong) — taxonomy of stats-code-relevant properties.
- The v0.3 research audit (`docs/v0.3_research_audit.md`) — catalogs
  the toolkit's own kernel-level reference-equivalence gaps.

See also: [reproducibility.md](reproducibility.md) (golden tests need
deterministic outputs).