# Claims and Gates This chapter covers `eval_toolkit.claims` — the layer that turns a `RunResult` into a **release-time go/no-go decision** by evaluating named gates against the result payload and (optionally) the manifest. The toolkit ships generic gates; consumers write their own when domain logic requires it. > **Background.** This chapter assumes you've already produced a > `RunResult` (see [getting-started](../getting-started.md)) and read > the broader framing in [evidence.md](evidence.md) (source roles, > threshold transfer, claim mode vs exploratory mode). The piece this > chapter pins down is the *contract* between a gate and a > `ClaimReport` — exactly what passing / failing means, how exceptions > are normalized, and what a `has_failures()` verdict commits to. (when-to-use)= ## When to reach for claims Use claims when you need a verdict, not a diagnostic: - "Is this release ready to ship?" - "Does this run satisfy the preregistered evidence requirements?" - "Should the report renderer print 'we claim X' or 'we cannot claim X'?" Don't use claims for exploratory metrics ("how did model A do?"), ablations, or score-distribution audits. Those produce numbers; claims produce booleans. (claims-data-model)= ## The data model A `ClaimSpec` bundles a claim's name with one or more `EvidenceGate`s. Each gate runs a `check(result, manifest) -> GateResult`. The full `ClaimReport` collects every gate's verdict and exposes `has_failures()` for the go/no-go decision. ```python from eval_toolkit.claims import ( ClaimSpec, EvidenceGate, GateResult, evaluate_claims, required_metric_gate, minimum_slice_size_gate, ) result = { "run_id": "demo-run", "by_slice": { "dev": { "n": 200, "n_positive": 100, "by_scorer": {"my_model": {"pr_auc": 0.82}}, }, }, } spec = ClaimSpec( name="dev_pr_auc_supported", gates=( required_metric_gate("dev", "my_model", "pr_auc"), minimum_slice_size_gate("dev", min_n=100, min_positive=20, min_negative=20), ), ) report = evaluate_claims(result, [spec]) assert report.has_failures() is False assert all(g.passed for g in report.claims["dev_pr_auc_supported"]) ``` `GateResult` is JSON-serializable and carries the gate's `name`, a `passed: bool`, the `severity` (`"error"`, `"warning"`, or `"info"`), a human-readable `message`, and a free-form `evidence` dict useful for report-renderer post-processing. `ClaimReport.to_dict()` writes the full per-gate tree plus the aggregated `has_failures` field — drop that into your `results.json` under `claim_report` (the v0.9-added `RunResult` field) or pass it through `with_claim_report(...)`. (claims-worked-walkthrough)= ## Worked walkthrough Start from a real `RunResult` shape: two scorers, two slices, mixed metrics. The claim is "the new model beats the baseline on the dev slice with statistically supported PR-AUC." ```python from eval_toolkit.claims import ( ClaimSpec, evaluate_claims, metric_threshold_gate, no_scorer_errors_gate, paired_diff_present_gate, required_scorer_gate, ) result = { "by_slice": { "dev": { "n": 500, "n_positive": 250, "by_scorer": { "baseline": {"pr_auc": 0.65}, "candidate": {"pr_auc": 0.82}, }, "paired_diffs": { "candidate_minus_baseline": { "delta": 0.17, "ci_95": [0.08, 0.26], } }, } } } spec = ClaimSpec( name="candidate_beats_baseline_on_dev", gates=( required_scorer_gate("dev", "candidate"), required_scorer_gate("dev", "baseline"), metric_threshold_gate( "dev", "candidate", "pr_auc", op=">=", threshold=0.80 ), paired_diff_present_gate("dev", "candidate_minus_baseline"), no_scorer_errors_gate(), ), ) report = evaluate_claims(result, [spec]) assert report.has_failures() is False gate_names = [g.name for g in report.claims["candidate_beats_baseline_on_dev"]] assert "metric_threshold:dev:candidate:pr_auc" in gate_names ``` Each gate is independent; failures don't short-circuit. That's deliberate: the report carries every verdict so the renderer can show "3 of 5 gates passed; here is what's missing" rather than "stopped at gate 2." (exception-contract)= ## Exception handling contract `EvidenceGate.evaluate` catches a *specific* set of runtime/data exceptions and converts them to typed failures: `KeyError`, `ValueError`, `TypeError`, `RuntimeError`, `AttributeError`, `LookupError`. Any of those raised inside a gate's `check` becomes a `GateResult(passed=False, message=f"{type(exc).__name__}: {exc}")`. Other exceptions — `NameError`, `AssertionError`, `ImportError`, `KeyboardInterrupt` — propagate. The rationale: data-shape errors are expected (a gate looks up a metric path that doesn't exist) and should record cleanly; implementer bugs are unexpected and should crash loudly so they don't get silently coerced into "gate failed" noise. ```python from collections.abc import Mapping from typing import Any from eval_toolkit.claims import ClaimSpec, EvidenceGate, GateResult, evaluate_claims def _buggy_check(result: Mapping[str, Any], manifest: Mapping[str, Any] | None) -> GateResult: # Typo: meant `result["by_slice"]`. KeyError is normalized to a typed failure. _ = result["by_slise"] return GateResult(name="buggy", passed=True) gate = EvidenceGate(name="buggy", check=_buggy_check) spec = ClaimSpec(name="demo", gates=(gate,)) report = evaluate_claims({"by_slice": {}}, [spec]) assert report.has_failures() is True assert report.claims["demo"][0].message.startswith("KeyError:") ``` Compare to an `AssertionError` (an implementer-bug class), which is *not* caught and propagates: ```python import pytest from collections.abc import Mapping from typing import Any from eval_toolkit.claims import ClaimSpec, EvidenceGate, GateResult, evaluate_claims def _asserts(result: Mapping[str, Any], manifest: Mapping[str, Any] | None) -> GateResult: assert False, "this is a bug, not a missing metric" gate = EvidenceGate(name="asserts", check=_asserts) spec = ClaimSpec(name="demo", gates=(gate,)) with pytest.raises(AssertionError): evaluate_claims({}, [spec]) ``` (severity-policy)= ## Severity policy A gate's `severity` controls whether failure flips `has_failures()`: - `error` (default) — failure flips `has_failures()` to True. - `warning` — failure does **not** flip `has_failures()` unless `report.has_failures(include_warnings=True)` is explicitly requested. - `info` — never flips `has_failures()`. Use for purely informational gates that produce evidence the renderer reads but don't gate release. Use `warning` for soft gates whose failure should *surface* in the report but not block release. Reserve `error` for the hard preconditions of the claim. ```python from eval_toolkit.claims import ( ClaimReport, GateResult, ) # A claim with one error-passing and one warning-failing gate. report = ClaimReport( claims={ "demo": [ GateResult(name="hard_gate", passed=True, severity="error"), GateResult(name="soft_gate", passed=False, severity="warning"), ] } ) assert report.has_failures() is False assert report.has_failures(include_warnings=True) is True ``` (claims-pitfalls)= ## Pitfalls / Common mistakes **Do attach the report.** `evaluate_claims` doesn't mutate the `RunResult` for you. Either set `RunResult.claim_report = report.to_dict()` directly or use `eval_toolkit.harness.with_claim_report` which returns a new `RunResult` with the field populated. **Don't write overly broad gates.** A gate that always returns `passed=True` (e.g., `lambda r, m: GateResult(name="g", passed=True)`) satisfies `has_failures()` trivially. Each gate should encode a checkable precondition — "this metric exists," "this slice is at least N rows," "this diff CI excludes zero." Walk through every gate and ask: *what would make this fail?* If the answer is "nothing realistic," delete it. **Don't lean on warnings for go/no-go.** `warning` severity is exactly that — surfaced in the report but not a fail. If the absence of something should *block* release, use `error`. **Do separate exploratory metrics from claim gates.** `RunResult` carries all the metrics — bootstrap CIs, slice-stratified PR-AUC, calibration error. Only the *subset* of those numbers that's preregistered as a claim precondition belongs in a `ClaimSpec`. The rest is exploratory and lives in the result payload, not in the gates. **Watch for KeyError-as-failure masking real bugs.** Because `KeyError` is in the catch list, a gate that fails because *you typed a metric path wrong* looks identical to a gate that fails because the metric is genuinely missing. Log gate failures during development and read the `message` field — it includes the exception type prefix. ## See also - [evidence.md](evidence.md) — broader framing of source roles, threshold transfer, and the claim-mode contract. - [artifacts.md](artifacts.md) — the `PredictionArtifactRef` / `MetricState` data layer that claim gates read from. - [versioning.md](versioning.md#schema-evolution) — schema-evolution policy for the `claim_report` field on `RunResult`. - [../extending.md](../extending.md) — how to write a custom gate that follows the `EvidenceGate` contract.