--- jupytext: text_representation: extension: .md format_name: myst kernelspec: display_name: Python 3 language: python name: python3 --- # Worked example: claims + evidence gates > **What this shows.** Compose `EvidenceGate`s to drive a binary > release decision ("ship this model" vs "block on insufficient > evidence"). Each gate is a single check; `evaluate_claims` runs them > against a `RunResult` and emits a structured `ClaimReport`. > > **Runtime:** ~1 s. Pure-numpy core; no optional deps. ## Setup ```{code-cell} import numpy as np import pandas as pd from eval_toolkit import ( EvalSlice, evaluate, set_global_seeds, ClaimSpec, evaluate_claims, metric_threshold_gate, minimum_slice_size_gate, ) set_global_seeds(42) ``` ## Build a small `RunResult` to test against (See [`evaluate_harness.md`](evaluate_harness.md) for the full pipeline; here we just need a `RunResult` to run claims on.) ```{code-cell} rng = np.random.default_rng(42) n = 100 y = np.concatenate([np.zeros(50), np.ones(50)]).astype(int) rng.shuffle(y) df = pd.DataFrame({"text": [f"r{i}" for i in range(n)], "label": y}) parent = EvalSlice(name="test", df=df) class _Stub: def predict_proba(self, X): # Discriminative-ish: scores correlated with the synthetic label labels = np.array([int(x[1:]) % 2 for x in X]) return np.clip( 0.5 + 0.3 * (labels - 0.5) + rng.normal(0, 0.1, size=len(X)), 0.0, 1.0, ) result = evaluate( scorers={"model_a": _Stub()}, slices=[parent], run_id="claim_example", n_resamples=50, seed=42, ) ``` ## Compose a `ClaimSpec` from gates A `ClaimSpec` is a named bundle of gates. All gates must pass for the claim to pass. The toolkit ships a library of reusable gates; you can also write custom ones (any callable matching the `EvidenceGate` Protocol). Two common ones: ```{code-cell} claim = ClaimSpec( name="model_a_releasable", gates=( # Quality bar: PR-AUC on the test slice must be ≥ 0.60 metric_threshold_gate( slice_name="test", scorer_name="model_a", metric_path="pr_auc", op=">=", threshold=0.60, ), # Statistical-power floor: don't claim anything on a tiny slice minimum_slice_size_gate(slice_name="test", min_n=30), ), ) ``` ## Run the claim `evaluate_claims(result, [claim])` runs every gate against the `RunResult` and produces a `ClaimReport` keyed by claim name → list of `GateResult`: ```{code-cell} report = evaluate_claims(result, [claim]) for gate_result in report.claims["model_a_releasable"]: status = "PASS" if gate_result.passed else "FAIL" print(f" [{status}] {gate_result.name}: {gate_result.message}") print(f"has_failures: {report.has_failures()}") ``` In this synthetic example the noisy stub's `pr_auc` is around 0.51 (not significantly better than random for this small slice), so the threshold gate FAILS. The minimum-size gate PASSES (n=100 ≥ 30). The claim as a whole fails. ## Release decision: every claim must pass The pattern for shipping a release: every `ClaimSpec` must pass. `report.has_failures()` returns `True` if any gate failed: ```{code-cell} if report.has_failures(): print("BLOCK: at least one gate failed — see evidence above") else: print("RELEASE: all gates passed") ``` `GateResult.evidence` carries structured detail (the actual value, the threshold, the metric path) so a downstream tool can render the failure breakdown in CI logs or a release dashboard. ## Pre-1.0 design note The `EvidenceGate` Protocol is intentionally simple: a callable that takes `(RunResult, ...) -> GateResult`. Custom gates plug into `ClaimSpec.gates` alongside the toolkit-provided ones. This matches the [NeurIPS Reproducibility Checklist](https://aclrollingreview.org/responsibleNLPresearch/) pattern of structured evidence: each claim is gated on specific, auditable conditions. For multi-comparison correction (Bonferroni / BH-FDR) when running many gates over many slices, see issue [#1](https://github.com/brandon-behring/eval-toolkit/issues/1) (planned). ## See also - [`claims.py` reference](../api/claims.md) — full gate library (`required_scorer_gate`, `required_slice_gate`, `paired_diff_present_gate`, `no_leakage_errors_gate`, `headline_present_gate`, `low_fpr_feasibility_gate`, etc.). - [`evidence.py` reference](../api/evidence.md) — `EvidenceAxis`, `AggregateEvidence` for typed claim aggregation. - [Evaluate harness example](evaluate_harness.md) — the upstream `RunResult` that gets fed into `evaluate_claims`.