Evidence And Claims#
This chapter covers the generic evidence layer added for v1-prelude: source-role metadata, validation-fit operating points applied across slices, and claim gates. The goal is not to render a report. The goal is to make it hard for a consumer project to publish a claim whose required evidence is missing.
Claim Mode vs Exploratory Mode#
Exploratory runs may record missing evidence as ordinary context. Claim mode should fail closed: if the declared headline comparison, source roles, sample size, leakage checks, or diagnostic gates are missing, the consumer report should say “no-go” or refuse to render claim language.
The toolkit supplies machine-readable gates only. Consumer projects own their domain claim text and any report renderer.
Source Roles#
Use source roles when datasets have different evidentiary weight. The toolkit does not enforce a taxonomy, but the recommended starting vocabulary is:
trainvalidationdevelopment_evalexternal_diagnosticfinal_holdout_candidatelocked_final_holdoutexcluded
These roles are optional manifest metadata. They are useful when a claim gate needs to ask whether a run actually included the evidence class it says it used.
from eval_toolkit import SourceRoleRecord, build_manifest
manifest = build_manifest(
run_id="demo",
config={"seed": 42},
source_roles=[
SourceRoleRecord(source="main_train", role="train", n_rows=1000),
SourceRoleRecord(source="hard_negatives", role="external_diagnostic", n_rows=200),
],
required_source_roles=("train", "external_diagnostic"),
guardrails=["do not tune thresholds on external diagnostics"],
)
assert manifest.source_roles[0]["role"] == "train"
Threshold Transfer#
For operating-point evidence, fit the threshold on a mixed-class validation slice and apply that exact threshold elsewhere. This is the right primitive for OOD slices that are all-positive or all-negative: the target slice cannot fit a threshold itself, but it can still answer recall or false-positive-rate questions at a threshold chosen upstream.
import numpy as np
from eval_toolkit import MaxF1Selector, apply_operating_points, fit_operating_points
y_val = np.array([0, 0, 1, 1])
s_val = np.array([0.1, 0.3, 0.7, 0.9])
y_ood = np.array([1, 1, 1])
s_ood = np.array([0.8, 0.4, 0.95])
fitted = fit_operating_points(
y_val,
s_val,
[MaxF1Selector()],
fitted_on_slice="validation",
scorer_name="model",
)
applied = apply_operating_points(
y_ood,
s_ood,
fitted,
applied_to_slice="ood_positive",
scorer_name="model",
)
assert applied["max_f1"]["slice_class"] == "all_positive"
assert "threshold_provenance" in applied["max_f1"]
When using the harness, pass OperatingPointSpec to evaluate. The
result is attached under each target scorer block as
transferred_operating_points, preserving where the threshold was fit
and where it was applied.
import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, MaxF1Selector, OperatingPointSpec, evaluate
class ScoreByText:
def predict_proba(self, X):
table = {"v0": 0.1, "v1": 0.2, "v2": 0.8, "v3": 0.9, "h0": 0.1, "h1": 0.9}
return np.array([table[str(x)] for x in X])
validation = EvalSlice(
name="validation",
df=pd.DataFrame({"text": ["v0", "v1", "v2", "v3"], "label": [0, 0, 1, 1]}),
)
hard_negative = EvalSlice(
name="hard_negative",
df=pd.DataFrame({"text": ["h0", "h1"], "label": [0, 0]}),
)
run = evaluate(
{"model": ScoreByText()},
[validation, hard_negative],
run_id="threshold-transfer",
n_resamples=10,
operating_point_specs=[
OperatingPointSpec(
name="validation_fit",
fit_slice="validation",
apply_slices=("hard_negative",),
selectors=(MaxF1Selector(),),
)
],
)
fpr = run.by_slice["hard_negative"]["by_scorer"]["model"][
"transferred_operating_points"
]["validation_fit"]["max_f1"]["fpr@threshold"]
assert fpr == 0.5
Claim Gates#
Claim gates are small checks over a result payload and optional manifest. They are intentionally generic: the consumer names the claim, chooses the gates, and renders any report text. Gates can be used for claim-bearing runs, while exploratory runs can persist the same report as context without treating failures as publish blockers.
from eval_toolkit import (
ClaimSpec,
RunResult,
evaluate_claims,
low_fpr_feasibility_gate,
metric_threshold_gate,
minimum_slice_size_gate,
no_scorer_errors_gate,
source_role_gate,
with_claim_report,
)
result = {
"by_slice": {
"test": {"n": 200, "n_positive": 80, "by_scorer": {}},
"hard_negative": {
"n": 200,
"n_positive": 0,
"by_scorer": {
"model": {
"transferred_operating_points": {
"calib": {"max_f1": {"fpr@threshold": 0.01}}
}
}
},
},
}
}
manifest_payload = {
"source_roles": [
{"source": "main", "role": "train"},
{"source": "diag", "role": "external_diagnostic"},
]
}
claim = ClaimSpec(
name="example claim",
gates=(
minimum_slice_size_gate("test", min_n=100, min_positive=40, min_negative=40),
source_role_gate(("train", "external_diagnostic")),
metric_threshold_gate(
"hard_negative",
"model",
"transferred_operating_points.calib.max_f1.fpr@threshold",
op="<=",
threshold=0.05,
),
no_scorer_errors_gate(),
),
)
report = evaluate_claims(result, [claim], manifest=manifest_payload)
assert not report.has_failures()
run_result = RunResult(run_id="claim-demo", git_sha=None, config={}, by_slice=result["by_slice"])
stored = with_claim_report(run_result, report)
assert stored.claim_report["has_failures"] is False
Low-FPR Feasibility#
A tiny holdout cannot support a low-FPR claim even if it observes zero
false positives. low_fpr_feasibility_gate computes the best-case
Wilson upper bound for 0 / n_negative and requires that upper bound to
be no larger than the requested FPR.
from eval_toolkit import ClaimSpec, evaluate_claims, low_fpr_feasibility_gate
tiny_result = {"by_slice": {"holdout": {"n": 48, "n_positive": 23}}}
tiny_claim = ClaimSpec(
name="low-FPR claim",
gates=(low_fpr_feasibility_gate("holdout", max_fpr=0.05),),
)
tiny_report = evaluate_claims(tiny_result, [tiny_claim])
gate = tiny_report.claims["low-FPR claim"][0]
assert tiny_report.has_failures()
assert gate.evidence["n_negative"] == 25
assert gate.evidence["best_case_fpr_ci_high"] > 0.05
Pitfalls / Common mistakes#
Fitting a threshold on the target OOD slice. If that slice is part of claim evidence, this leaks target information into the operating point.
Treating single-class OOD recall as a full classifier claim. All-positive slices cannot estimate precision, FPR, or calibration.
Letting diagnostics stay advisory in claim mode. If a hard-negative diagnostic is required for a claim, encode its FPR cap as a gate.
Claiming low FPR from too few negatives. A zero-FP result on a small negative slice can still have a large upper confidence bound.
Putting domain policy into the toolkit. Domain labels, claim text, and deployment policy belong in the consumer project.