# Evidence And Claims This chapter covers the generic evidence layer added for v1-prelude: source-role metadata, validation-fit operating points applied across slices, and claim gates. The goal is not to render a report. The goal is to make it hard for a consumer project to publish a claim whose required evidence is missing. (claim-mode)= ## Claim Mode vs Exploratory Mode Exploratory runs may record missing evidence as ordinary context. Claim mode should fail closed: if the declared headline comparison, source roles, sample size, leakage checks, or diagnostic gates are missing, the consumer report should say "no-go" or refuse to render claim language. The toolkit supplies machine-readable gates only. Consumer projects own their domain claim text and any report renderer. (source-roles)= ## Source Roles Use source roles when datasets have different evidentiary weight. The toolkit does not enforce a taxonomy, but the recommended starting vocabulary is: - `train` - `validation` - `development_eval` - `external_diagnostic` - `final_holdout_candidate` - `locked_final_holdout` - `excluded` These roles are optional manifest metadata. They are useful when a claim gate needs to ask whether a run actually included the evidence class it says it used. ```python from eval_toolkit import SourceRoleRecord, build_manifest manifest = build_manifest( run_id="demo", config={"seed": 42}, source_roles=[ SourceRoleRecord(source="main_train", role="train", n_rows=1000), SourceRoleRecord(source="hard_negatives", role="external_diagnostic", n_rows=200), ], required_source_roles=("train", "external_diagnostic"), guardrails=["do not tune thresholds on external diagnostics"], ) assert manifest.source_roles[0]["role"] == "train" ``` (evidence-threshold-transfer)= ## Threshold Transfer For operating-point evidence, fit the threshold on a mixed-class validation slice and apply that exact threshold elsewhere. This is the right primitive for OOD slices that are all-positive or all-negative: the target slice cannot fit a threshold itself, but it can still answer recall or false-positive-rate questions at a threshold chosen upstream. ```python import numpy as np from eval_toolkit import MaxF1Selector, apply_operating_points, fit_operating_points y_val = np.array([0, 0, 1, 1]) s_val = np.array([0.1, 0.3, 0.7, 0.9]) y_ood = np.array([1, 1, 1]) s_ood = np.array([0.8, 0.4, 0.95]) fitted = fit_operating_points( y_val, s_val, [MaxF1Selector()], fitted_on_slice="validation", scorer_name="model", ) applied = apply_operating_points( y_ood, s_ood, fitted, applied_to_slice="ood_positive", scorer_name="model", ) assert applied["max_f1"]["slice_class"] == "all_positive" assert "threshold_provenance" in applied["max_f1"] ``` When using the harness, pass `OperatingPointSpec` to `evaluate`. The result is attached under each target scorer block as `transferred_operating_points`, preserving where the threshold was fit and where it was applied. ```python import numpy as np import pandas as pd from eval_toolkit import EvalSlice, MaxF1Selector, OperatingPointSpec, evaluate class ScoreByText: def predict_proba(self, X): table = {"v0": 0.1, "v1": 0.2, "v2": 0.8, "v3": 0.9, "h0": 0.1, "h1": 0.9} return np.array([table[str(x)] for x in X]) validation = EvalSlice( name="validation", df=pd.DataFrame({"text": ["v0", "v1", "v2", "v3"], "label": [0, 0, 1, 1]}), ) hard_negative = EvalSlice( name="hard_negative", df=pd.DataFrame({"text": ["h0", "h1"], "label": [0, 0]}), ) run = evaluate( {"model": ScoreByText()}, [validation, hard_negative], run_id="threshold-transfer", n_resamples=10, operating_point_specs=[ OperatingPointSpec( name="validation_fit", fit_slice="validation", apply_slices=("hard_negative",), selectors=(MaxF1Selector(),), ) ], ) fpr = run.by_slice["hard_negative"]["by_scorer"]["model"][ "transferred_operating_points" ]["validation_fit"]["max_f1"]["fpr@threshold"] assert fpr == 0.5 ``` (claim-gates)= ## Claim Gates Claim gates are small checks over a result payload and optional manifest. They are intentionally generic: the consumer names the claim, chooses the gates, and renders any report text. Gates can be used for claim-bearing runs, while exploratory runs can persist the same report as context without treating failures as publish blockers. ```python from eval_toolkit import ( ClaimSpec, RunResult, evaluate_claims, low_fpr_feasibility_gate, metric_threshold_gate, minimum_slice_size_gate, no_scorer_errors_gate, source_role_gate, with_claim_report, ) result = { "by_slice": { "test": {"n": 200, "n_positive": 80, "by_scorer": {}}, "hard_negative": { "n": 200, "n_positive": 0, "by_scorer": { "model": { "transferred_operating_points": { "calib": {"max_f1": {"fpr@threshold": 0.01}} } } }, }, } } manifest_payload = { "source_roles": [ {"source": "main", "role": "train"}, {"source": "diag", "role": "external_diagnostic"}, ] } claim = ClaimSpec( name="example claim", gates=( minimum_slice_size_gate("test", min_n=100, min_positive=40, min_negative=40), source_role_gate(("train", "external_diagnostic")), metric_threshold_gate( "hard_negative", "model", "transferred_operating_points.calib.max_f1.fpr@threshold", op="<=", threshold=0.05, ), no_scorer_errors_gate(), ), ) report = evaluate_claims(result, [claim], manifest=manifest_payload) assert not report.has_failures() run_result = RunResult(run_id="claim-demo", git_sha=None, config={}, by_slice=result["by_slice"]) stored = with_claim_report(run_result, report) assert stored.claim_report["has_failures"] is False ``` (low-fpr-feasibility)= ## Low-FPR Feasibility A tiny holdout cannot support a low-FPR claim even if it observes zero false positives. `low_fpr_feasibility_gate` computes the best-case Wilson upper bound for `0 / n_negative` and requires that upper bound to be no larger than the requested FPR. ```python from eval_toolkit import ClaimSpec, evaluate_claims, low_fpr_feasibility_gate tiny_result = {"by_slice": {"holdout": {"n": 48, "n_positive": 23}}} tiny_claim = ClaimSpec( name="low-FPR claim", gates=(low_fpr_feasibility_gate("holdout", max_fpr=0.05),), ) tiny_report = evaluate_claims(tiny_result, [tiny_claim]) gate = tiny_report.claims["low-FPR claim"][0] assert tiny_report.has_failures() assert gate.evidence["n_negative"] == 25 assert gate.evidence["best_case_fpr_ci_high"] > 0.05 ``` (evidence-pitfalls)= ## Pitfalls / Common mistakes - **Fitting a threshold on the target OOD slice.** If that slice is part of claim evidence, this leaks target information into the operating point. - **Treating single-class OOD recall as a full classifier claim.** All-positive slices cannot estimate precision, FPR, or calibration. - **Letting diagnostics stay advisory in claim mode.** If a hard-negative diagnostic is required for a claim, encode its FPR cap as a gate. - **Claiming low FPR from too few negatives.** A zero-FP result on a small negative slice can still have a large upper confidence bound. - **Putting domain policy into the toolkit.** Domain labels, claim text, and deployment policy belong in the consumer project.