Claims and Gates#

This chapter covers eval_toolkit.claims — the layer that turns a RunResult into a release-time go/no-go decision by evaluating named gates against the result payload and (optionally) the manifest. The toolkit ships generic gates; consumers write their own when domain logic requires it.

Background. This chapter assumes you’ve already produced a RunResult (see getting-started) and read the broader framing in evidence.md (source roles, threshold transfer, claim mode vs exploratory mode). The piece this chapter pins down is the contract between a gate and a ClaimReport — exactly what passing / failing means, how exceptions are normalized, and what a has_failures() verdict commits to.

When to reach for claims#

Use claims when you need a verdict, not a diagnostic:

“Is this release ready to ship?”
“Does this run satisfy the preregistered evidence requirements?”
“Should the report renderer print ‘we claim X’ or ‘we cannot claim X’?”

Don’t use claims for exploratory metrics (“how did model A do?”), ablations, or score-distribution audits. Those produce numbers; claims produce booleans.

The data model#

A ClaimSpec bundles a claim’s name with one or more EvidenceGates. Each gate runs a check(result, manifest) -> GateResult. The full ClaimReport collects every gate’s verdict and exposes has_failures() for the go/no-go decision.

from eval_toolkit.claims import (
    ClaimSpec,
    EvidenceGate,
    GateResult,
    evaluate_claims,
    required_metric_gate,
    minimum_slice_size_gate,
)

result = {
    "run_id": "demo-run",
    "by_slice": {
        "dev": {
            "n": 200,
            "n_positive": 100,
            "by_scorer": {"my_model": {"pr_auc": 0.82}},
        },
    },
}

spec = ClaimSpec(
    name="dev_pr_auc_supported",
    gates=(
        required_metric_gate("dev", "my_model", "pr_auc"),
        minimum_slice_size_gate("dev", min_n=100, min_positive=20, min_negative=20),
    ),
)

report = evaluate_claims(result, [spec])
assert report.has_failures() is False
assert all(g.passed for g in report.claims["dev_pr_auc_supported"])

GateResult is JSON-serializable and carries the gate’s name, a passed: bool, the severity ("error", "warning", or "info"), a human-readable message, and a free-form evidence dict useful for report-renderer post-processing.

ClaimReport.to_dict() writes the full per-gate tree plus the aggregated has_failures field — drop that into your results.json under claim_report (the v0.9-added RunResult field) or pass it through with_claim_report(...).

Worked walkthrough#

Start from a real RunResult shape: two scorers, two slices, mixed metrics. The claim is “the new model beats the baseline on the dev slice with statistically supported PR-AUC.”

from eval_toolkit.claims import (
    ClaimSpec,
    evaluate_claims,
    metric_threshold_gate,
    no_scorer_errors_gate,
    paired_diff_present_gate,
    required_scorer_gate,
)

result = {
    "by_slice": {
        "dev": {
            "n": 500,
            "n_positive": 250,
            "by_scorer": {
                "baseline": {"pr_auc": 0.65},
                "candidate": {"pr_auc": 0.82},
            },
            "paired_diffs": {
                "candidate_minus_baseline": {
                    "delta": 0.17,
                    "ci_95": [0.08, 0.26],
                }
            },
        }
    }
}

spec = ClaimSpec(
    name="candidate_beats_baseline_on_dev",
    gates=(
        required_scorer_gate("dev", "candidate"),
        required_scorer_gate("dev", "baseline"),
        metric_threshold_gate(
            "dev", "candidate", "pr_auc", op=">=", threshold=0.80
        ),
        paired_diff_present_gate("dev", "candidate_minus_baseline"),
        no_scorer_errors_gate(),
    ),
)

report = evaluate_claims(result, [spec])
assert report.has_failures() is False
gate_names = [g.name for g in report.claims["candidate_beats_baseline_on_dev"]]
assert "metric_threshold:dev:candidate:pr_auc" in gate_names

Each gate is independent; failures don’t short-circuit. That’s deliberate: the report carries every verdict so the renderer can show “3 of 5 gates passed; here is what’s missing” rather than “stopped at gate 2.”

Exception handling contract#

EvidenceGate.evaluate catches a specific set of runtime/data exceptions and converts them to typed failures:

KeyError, ValueError, TypeError, RuntimeError, AttributeError, LookupError.

Any of those raised inside a gate’s check becomes a GateResult(passed=False, message=f"{type(exc).__name__}: {exc}"). Other exceptions — NameError, AssertionError, ImportError, KeyboardInterrupt — propagate. The rationale: data-shape errors are expected (a gate looks up a metric path that doesn’t exist) and should record cleanly; implementer bugs are unexpected and should crash loudly so they don’t get silently coerced into “gate failed” noise.

from collections.abc import Mapping
from typing import Any
from eval_toolkit.claims import ClaimSpec, EvidenceGate, GateResult, evaluate_claims


def _buggy_check(result: Mapping[str, Any], manifest: Mapping[str, Any] | None) -> GateResult:
    # Typo: meant `result["by_slice"]`. KeyError is normalized to a typed failure.
    _ = result["by_slise"]
    return GateResult(name="buggy", passed=True)


gate = EvidenceGate(name="buggy", check=_buggy_check)
spec = ClaimSpec(name="demo", gates=(gate,))
report = evaluate_claims({"by_slice": {}}, [spec])
assert report.has_failures() is True
assert report.claims["demo"][0].message.startswith("KeyError:")

Compare to an AssertionError (an implementer-bug class), which is not caught and propagates:

import pytest
from collections.abc import Mapping
from typing import Any
from eval_toolkit.claims import ClaimSpec, EvidenceGate, GateResult, evaluate_claims


def _asserts(result: Mapping[str, Any], manifest: Mapping[str, Any] | None) -> GateResult:
    assert False, "this is a bug, not a missing metric"


gate = EvidenceGate(name="asserts", check=_asserts)
spec = ClaimSpec(name="demo", gates=(gate,))
with pytest.raises(AssertionError):
    evaluate_claims({}, [spec])

Severity policy#

A gate’s severity controls whether failure flips has_failures():

error (default) — failure flips has_failures() to True.
warning — failure does not flip has_failures() unless report.has_failures(include_warnings=True) is explicitly requested.
info — never flips has_failures(). Use for purely informational gates that produce evidence the renderer reads but don’t gate release.

Use warning for soft gates whose failure should surface in the report but not block release. Reserve error for the hard preconditions of the claim.

from eval_toolkit.claims import (
    ClaimReport,
    GateResult,
)

# A claim with one error-passing and one warning-failing gate.
report = ClaimReport(
    claims={
        "demo": [
            GateResult(name="hard_gate", passed=True, severity="error"),
            GateResult(name="soft_gate", passed=False, severity="warning"),
        ]
    }
)
assert report.has_failures() is False
assert report.has_failures(include_warnings=True) is True

Pitfalls / Common mistakes#

Do attach the report. evaluate_claims doesn’t mutate the RunResult for you. Either set RunResult.claim_report = report.to_dict() directly or use eval_toolkit.harness.with_claim_report which returns a new RunResult with the field populated.

Don’t write overly broad gates. A gate that always returns passed=True (e.g., lambda r, m: GateResult(name="g", passed=True)) satisfies has_failures() trivially. Each gate should encode a checkable precondition — “this metric exists,” “this slice is at least N rows,” “this diff CI excludes zero.” Walk through every gate and ask: what would make this fail? If the answer is “nothing realistic,” delete it.

Don’t lean on warnings for go/no-go. warning severity is exactly that — surfaced in the report but not a fail. If the absence of something should block release, use error.

Do separate exploratory metrics from claim gates. RunResult carries all the metrics — bootstrap CIs, slice-stratified PR-AUC, calibration error. Only the subset of those numbers that’s preregistered as a claim precondition belongs in a ClaimSpec. The rest is exploratory and lives in the result payload, not in the gates.

Watch for KeyError-as-failure masking real bugs. Because KeyError is in the catch list, a gate that fails because you typed a metric path wrong looks identical to a gate that fails because the metric is genuinely missing. Log gate failures during development and read the message field — it includes the exception type prefix.