Worked example: claims + evidence gates#
What this shows. Compose
EvidenceGates to drive a binary release decision (“ship this model” vs “block on insufficient evidence”). Each gate is a single check;evaluate_claimsruns them against aRunResultand emits a structuredClaimReport.Runtime: ~1 s. Pure-numpy core; no optional deps.
Setup#
import numpy as np
import pandas as pd
from eval_toolkit import (
EvalSlice, evaluate, set_global_seeds,
ClaimSpec, evaluate_claims,
metric_threshold_gate,
minimum_slice_size_gate,
)
set_global_seeds(42)
Build a small RunResult to test against#
(See evaluate_harness.md for the full pipeline;
here we just need a RunResult to run claims on.)
rng = np.random.default_rng(42)
n = 100
y = np.concatenate([np.zeros(50), np.ones(50)]).astype(int)
rng.shuffle(y)
df = pd.DataFrame({"text": [f"r{i}" for i in range(n)], "label": y})
parent = EvalSlice(name="test", df=df)
class _Stub:
def predict_proba(self, X):
# Discriminative-ish: scores correlated with the synthetic label
labels = np.array([int(x[1:]) % 2 for x in X])
return np.clip(
0.5 + 0.3 * (labels - 0.5) + rng.normal(0, 0.1, size=len(X)),
0.0, 1.0,
)
result = evaluate(
scorers={"model_a": _Stub()},
slices=[parent],
run_id="claim_example",
n_resamples=50,
seed=42,
)
Compose a ClaimSpec from gates#
A ClaimSpec is a named bundle of gates. All gates must pass for the
claim to pass. The toolkit ships a library of reusable gates; you can
also write custom ones (any callable matching the EvidenceGate
Protocol). Two common ones:
claim = ClaimSpec(
name="model_a_releasable",
gates=(
# Quality bar: PR-AUC on the test slice must be ≥ 0.60
metric_threshold_gate(
slice_name="test",
scorer_name="model_a",
metric_path="pr_auc",
op=">=",
threshold=0.60,
),
# Statistical-power floor: don't claim anything on a tiny slice
minimum_slice_size_gate(slice_name="test", min_n=30),
),
)
Run the claim#
evaluate_claims(result, [claim]) runs every gate against the
RunResult and produces a ClaimReport keyed by claim name → list of
GateResult:
report = evaluate_claims(result, [claim])
for gate_result in report.claims["model_a_releasable"]:
status = "PASS" if gate_result.passed else "FAIL"
print(f" [{status}] {gate_result.name}: {gate_result.message}")
print(f"has_failures: {report.has_failures()}")
[FAIL] metric_threshold:test:model_a:pr_auc: metric threshold failed
[PASS] minimum_slice_size:test: slice size sufficient
has_failures: True
In this synthetic example the noisy stub’s pr_auc is around 0.51 (not
significantly better than random for this small slice), so the
threshold gate FAILS. The minimum-size gate PASSES (n=100 ≥ 30). The
claim as a whole fails.
Release decision: every claim must pass#
The pattern for shipping a release: every ClaimSpec must pass.
report.has_failures() returns True if any gate failed:
if report.has_failures():
print("BLOCK: at least one gate failed — see evidence above")
else:
print("RELEASE: all gates passed")
BLOCK: at least one gate failed — see evidence above
GateResult.evidence carries structured detail (the actual value, the
threshold, the metric path) so a downstream tool can render the failure
breakdown in CI logs or a release dashboard.
Pre-1.0 design note#
The EvidenceGate Protocol is intentionally simple: a callable that
takes (RunResult, ...) -> GateResult. Custom gates plug into
ClaimSpec.gates alongside the toolkit-provided ones. This matches
the
NeurIPS Reproducibility Checklist
pattern of structured evidence: each claim is gated on specific,
auditable conditions.
For multi-comparison correction (Bonferroni / BH-FDR) when running many gates over many slices, see issue #1 (planned).
See also#
claims.pyreference — full gate library (required_scorer_gate,required_slice_gate,paired_diff_present_gate,no_leakage_errors_gate,headline_present_gate,low_fpr_feasibility_gate, etc.).evidence.pyreference —EvidenceAxis,AggregateEvidencefor typed claim aggregation.Evaluate harness example — the upstream
RunResultthat gets fed intoevaluate_claims.