Worked example: claims + evidence gates#

What this shows. Compose EvidenceGates to drive a binary release decision (“ship this model” vs “block on insufficient evidence”). Each gate is a single check; evaluate_claims runs them against a RunResult and emits a structured ClaimReport.

Runtime: ~1 s. Pure-numpy core; no optional deps.

Setup#

import numpy as np
import pandas as pd
from eval_toolkit import (
    EvalSlice, evaluate, set_global_seeds,
    ClaimSpec, evaluate_claims,
    metric_threshold_gate,
    minimum_slice_size_gate,
)
set_global_seeds(42)

Build a small RunResult to test against#

(See evaluate_harness.md for the full pipeline; here we just need a RunResult to run claims on.)

rng = np.random.default_rng(42)
n = 100
y = np.concatenate([np.zeros(50), np.ones(50)]).astype(int)
rng.shuffle(y)
df = pd.DataFrame({"text": [f"r{i}" for i in range(n)], "label": y})
parent = EvalSlice(name="test", df=df)

class _Stub:
    def predict_proba(self, X):
        # Discriminative-ish: scores correlated with the synthetic label
        labels = np.array([int(x[1:]) % 2 for x in X])
        return np.clip(
            0.5 + 0.3 * (labels - 0.5) + rng.normal(0, 0.1, size=len(X)),
            0.0, 1.0,
        )

result = evaluate(
    scorers={"model_a": _Stub()},
    slices=[parent],
    run_id="claim_example",
    n_resamples=50,
    seed=42,
)

Compose a ClaimSpec from gates#

A ClaimSpec is a named bundle of gates. All gates must pass for the claim to pass. The toolkit ships a library of reusable gates; you can also write custom ones (any callable matching the EvidenceGate Protocol). Two common ones:

claim = ClaimSpec(
    name="model_a_releasable",
    gates=(
        # Quality bar: PR-AUC on the test slice must be ≥ 0.60
        metric_threshold_gate(
            slice_name="test",
            scorer_name="model_a",
            metric_path="pr_auc",
            op=">=",
            threshold=0.60,
        ),
        # Statistical-power floor: don't claim anything on a tiny slice
        minimum_slice_size_gate(slice_name="test", min_n=30),
    ),
)

Run the claim#

evaluate_claims(result, [claim]) runs every gate against the RunResult and produces a ClaimReport keyed by claim name → list of GateResult:

report = evaluate_claims(result, [claim])
for gate_result in report.claims["model_a_releasable"]:
    status = "PASS" if gate_result.passed else "FAIL"
    print(f"  [{status}] {gate_result.name}: {gate_result.message}")
print(f"has_failures: {report.has_failures()}")
  [FAIL] metric_threshold:test:model_a:pr_auc: metric threshold failed
  [PASS] minimum_slice_size:test: slice size sufficient
has_failures: True

In this synthetic example the noisy stub’s pr_auc is around 0.51 (not significantly better than random for this small slice), so the threshold gate FAILS. The minimum-size gate PASSES (n=100 ≥ 30). The claim as a whole fails.

Release decision: every claim must pass#

The pattern for shipping a release: every ClaimSpec must pass. report.has_failures() returns True if any gate failed:

if report.has_failures():
    print("BLOCK: at least one gate failed — see evidence above")
else:
    print("RELEASE: all gates passed")
BLOCK: at least one gate failed — see evidence above

GateResult.evidence carries structured detail (the actual value, the threshold, the metric path) so a downstream tool can render the failure breakdown in CI logs or a release dashboard.

Pre-1.0 design note#

The EvidenceGate Protocol is intentionally simple: a callable that takes (RunResult, ...) -> GateResult. Custom gates plug into ClaimSpec.gates alongside the toolkit-provided ones. This matches the NeurIPS Reproducibility Checklist pattern of structured evidence: each claim is gated on specific, auditable conditions.

For multi-comparison correction (Bonferroni / BH-FDR) when running many gates over many slices, see issue #1 (planned).

See also#

  • claims.py reference — full gate library (required_scorer_gate, required_slice_gate, paired_diff_present_gate, no_leakage_errors_gate, headline_present_gate, low_fpr_feasibility_gate, etc.).

  • evidence.py referenceEvidenceAxis, AggregateEvidence for typed claim aggregation.

  • Evaluate harness example — the upstream RunResult that gets fed into evaluate_claims.