# v0.8.x → v0.9.0 migration v0.9.0 is **an additive feature release** introducing the *evidence core*: six new public modules (`claims`, `artifacts`, `evidence`, `operating_points`, `analysis`, `protocols`) and six new optional `RunResult` fields. No v0.8 public API was removed, renamed, or behavior-changed. Existing v0.8 harness code keeps working unchanged; v0.9 features are opt-in. This guide covers what's new, when to adopt each piece, and how to extend a v0.8 harness with v0.9 evidence gates. ## At a glance | Change | Type | Action | |---|---|---| | New `eval_toolkit.claims` module — `ClaimSpec`, `EvidenceGate`, `evaluate_claims` | Added | Optional: adopt to encode pass/fail evidence per claim. | | New `eval_toolkit.artifacts` module — `PredictionArtifactRef`, `validate_payload` | Added | Optional: serialize prediction artifacts for downstream consumers. | | New `eval_toolkit.evidence` module — `EvidenceAxis`, `PairingMetadata`, `AggregateEvidence` | Added | Optional: declare how aggregated evidence was produced. | | New `eval_toolkit.operating_points` module — `OperatingPointSpec`, `fit_operating_points`, `apply_operating_points` | Added | Optional: transfer a fitted operating point across slices. | | New `eval_toolkit.analysis` module — `bootstrap_metric_from_predictions`, `paired_diff_from_prediction_refs` | Added | Optional: bootstrap / paired-diff from saved predictions. | | New `eval_toolkit.protocols` module — `Scorer`, `SliceAwareScorer`, `EvalSliceLike`, `PredictionReader`, `Versioned` | Added | Pandas-free Protocol home. Re-exported from `eval_toolkit` and `eval_toolkit.harness`. | | `RunResult` gained 6 optional fields | Added | Optional: populate to publish v0.9 evidence; defaults preserve v0.8 shape. | | `build_manifest` gained `source_roles`, `required_source_roles`, `prediction_artifacts` kwargs | Added | Optional: declare data lineage and artifact provenance. | | New manifest schema `manifest.v1.json` | Added | Optional: validate manifests with `validate_payload`. | | `results.v1.json` / `results_full.v1.json` gained optional top-level fields | Changed (additive) | None — consumers tolerate via `additionalProperties: true`. | | New `[validation]` extra ships `jsonschema>=4.21` | Added | Install if you use `validate_payload`. | ## 1. New module map ### `eval_toolkit.claims` Generic evidence gates. Encode the conditions a claim must satisfy before you treat the metric report as supporting it. Ships ~12 gate constructors covering required-slice / required-metric / source-role / low-FPR-feasibility / strict-artifact / external-diagnostic / etc. Public surface: `ClaimSpec`, `ClaimReport`, `EvidenceGate`, `GateResult`, `evaluate_claims`, and the gate constructors. ### `eval_toolkit.artifacts` Prediction-artifact references and payload validation. A `PredictionArtifactRef` declares: where the per-row prediction file lives (URI), what its bytes hash to (`sha256`), and how its columns map to the canonical `(row_id, label, score, scorer, slice, …)` contract. `validate_payload(payload, schema_filename)` runs the schema in `eval_toolkit/schemas/` against the payload (requires `pip install "eval-toolkit[validation]"`). ### `eval_toolkit.evidence` Three small dataclasses for declaring aggregated-evidence provenance: `EvidenceAxis` (one dimension of aggregation — slice, scorer, fold, …), `PairingMetadata` (was the bootstrap paired? what was the unit?), `AggregateEvidence` (the rolled-up status + method + axes). ### `eval_toolkit.operating_points` `OperatingPointSpec` declares a fit target (e.g. "min recall ≥ 0.7 subject to FPR ≤ 0.05"); `fit_operating_points` produces a `FittedOperatingPoint` (threshold + diagnostics); `apply_operating_points` applies the fitted threshold to a different slice's scores. Enables principled cross-slice threshold transfer. ### `eval_toolkit.analysis` Filesystem-aware helpers that bridge prediction artifacts and the bootstrap kernel: `bootstrap_metric_from_predictions` and `paired_diff_from_prediction_refs`. Use when you've stored predictions to disk and want to re-derive CIs without re-running scorers. ### `eval_toolkit.protocols` Lightweight Protocols with **zero runtime pandas dependency** — the home for `Scorer`, `SliceAwareScorer`, `EvalSliceLike`, `PredictionReader`, `Versioned`. Pandas types appear in annotations only via `TYPE_CHECKING`. Re-exported from both `eval_toolkit` (top level) and `eval_toolkit.harness` for backward compatibility — consumer imports don't change. ## 2. `RunResult` field additions Six new fields, all optional, all default to empty containers: | Field | Type | What it carries | |---|---|---| | `claim_report` | `dict[str, object]` | Serialized `ClaimReport.to_dict()`. | | `prediction_artifacts` | `list[dict[str, object]]` | List of `PredictionArtifactRef.to_dict()`. | | `evidence_axes` | `list[dict[str, object]]` | List of `EvidenceAxis.to_dict()`. | | `pairing_metadata` | `dict[str, object]` | `PairingMetadata.to_dict()`. | | `aggregate_evidence` | `dict[str, object]` | `AggregateEvidence.to_dict()`. | | `threshold_policy` | `dict[str, object]` | `ThresholdPolicyMetadata.to_dict()`. | v0.8 construction still works — no kwargs needed: ```python from eval_toolkit.harness import RunResult result = RunResult(run_id="r", git_sha=None, config={}, by_slice={}) assert result.claim_report == {} assert result.prediction_artifacts == [] ``` v0.9 enriched construction — pass any subset: ```python from eval_toolkit.evidence import EvidenceAxis from eval_toolkit.harness import RunResult result = RunResult( run_id="r", git_sha=None, config={}, by_slice={}, evidence_axes=[EvidenceAxis(name="slice", value="locked_eval").to_dict()], ) assert result.evidence_axes[0]["name"] == "slice" ``` ## 3. End-to-end walkthrough — extending a v0.8 harness with v0.9 evidence This section shows a minimal v0.8 harness and how to extend it with v0.9 claim gates without rewriting the eval loop. ### v0.8: bare metric report A v0.8 harness scores one model on one slice and writes a result: ```python import numpy as np import pandas as pd from eval_toolkit import EvalSlice, evaluate class TinyScorer: """Toy scorer that returns calibrated noise.""" def predict_proba(self, X): rng = np.random.default_rng(0) return rng.uniform(0.2, 0.8, size=len(X)) df = pd.DataFrame({"text": [f"r{i}" for i in range(40)], "label": [0, 1] * 20}) slices = [EvalSlice(name="locked_eval", df=df)] result = evaluate({"tiny": TinyScorer()}, slices, run_id="v08-demo") assert "locked_eval" in result.by_slice ``` That's a complete v0.8 run. `result.by_slice["locked_eval"]` carries the metrics. No claim evidence is attached — consumers eyeball the numbers. ### v0.9: same harness + claim gates The v0.9 upgrade keeps the eval loop unchanged. After `evaluate`, build a `ClaimSpec` listing the evidence the claim requires, run `evaluate_claims`, and attach the report: ```python import numpy as np import pandas as pd from eval_toolkit import EvalSlice, evaluate from eval_toolkit.claims import ( ClaimSpec, evaluate_claims, required_slice_gate, required_scorer_gate, minimum_slice_size_gate, ) from eval_toolkit.harness import with_claim_report class TinyScorer: def predict_proba(self, X): rng = np.random.default_rng(0) return rng.uniform(0.2, 0.8, size=len(X)) df = pd.DataFrame({"text": [f"r{i}" for i in range(40)], "label": [0, 1] * 20}) slices = [EvalSlice(name="locked_eval", df=df)] result = evaluate({"tiny": TinyScorer()}, slices, run_id="v09-demo") # v0.9 addition: declare the evidence the "tiny works on locked_eval" # claim requires, then evaluate it against the result payload. claim = ClaimSpec( name="tiny works on locked_eval", gates=( required_slice_gate("locked_eval"), required_scorer_gate("locked_eval", "tiny"), minimum_slice_size_gate("locked_eval", min_n=20, min_positive=10), ), ) report = evaluate_claims(result.to_dict(), [claim]) # Frozen-by-value attachment — RunResult itself is immutable. enriched = with_claim_report(result, report) assert not report.has_failures() assert enriched.claim_report["has_failures"] is False assert len(enriched.claim_report["claims"]["tiny works on locked_eval"]) == 3 ``` The harness loop didn't change. The only new code is the `ClaimSpec` declaration, the `evaluate_claims` call, and the `with_claim_report` attachment. ### Failure path: what a missing gate looks like If the claim requires a slice that wasn't produced, the report records the failing gate without aborting: ```python from eval_toolkit.claims import ClaimSpec, evaluate_claims, required_slice_gate claim = ClaimSpec(name="ood claim", gates=(required_slice_gate("ood_unseen"),)) empty_result = {"by_slice": {}} report = evaluate_claims(empty_result, [claim]) assert report.has_failures() gate_result = report.claims["ood claim"][0] assert gate_result.passed is False assert "missing slice" in gate_result.message ``` This pattern lets your CI gate publication on `report.has_failures()` without sprinkling assertions throughout the eval code. ## 4. `Scorer` Protocol consolidation (v0.9.0 + v0.9.1) v0.9.0 introduced `eval_toolkit.protocols` as the pandas-free home for `Scorer` and `SliceAwareScorer` but **left duplicate copies in `eval_toolkit.harness`**. v0.9.1 removes the duplicates; `harness.py` now imports from `protocols.py`. The top-level re-export `from eval_toolkit import Scorer` resolves to the same class as before: ```python from eval_toolkit import Scorer from eval_toolkit.harness import Scorer as HarnessScorer from eval_toolkit.protocols import Scorer as ProtocolsScorer assert Scorer is HarnessScorer is ProtocolsScorer assert Scorer.__module__ == "eval_toolkit.protocols" ``` The canonical `Scorer.predict_proba` signature still accepts `list[str] | np.ndarray | pd.Series`. Pandas is imported under `TYPE_CHECKING` only in `protocols.py`, so `import eval_toolkit` does not pay a pandas import cost when pandas is uninstalled. ## 5. Schema additions `results.v1.json` and `results_full.v1.json` gained these top-level optional fields, mirroring the `RunResult` additions in section 2: `claim_report`, `prediction_artifacts`, `evidence_axes`, `pairing_metadata`, `aggregate_evidence`, `threshold_policy`. The schemas keep the `.v1` filename because every change is additive and the schemas declare `additionalProperties: true` — v0.8 consumers read v0.9 outputs without error. A new schema `manifest.v1.json` lands for validating `RunManifest.to_dict()` payloads. See [`methodology/versioning.md` § Schema evolution policy](../methodology/versioning.md#schema-evolution) for the policy that governs when a filename gets bumped to `.v2`. ## 6. The `validation` extra `jsonschema>=4.21` moved into a focused `[validation]` extra in v0.9. Install only if you use `eval_toolkit.artifacts.validate_payload`: ```bash pip install "eval-toolkit[validation]" ``` `validate_payload` lazy-imports jsonschema; if it's missing, it raises `ImportError` with a clear pointer to the extra. The `[all]` and `[dev]` extras include it transitively, so most users won't need to install it explicitly. ## 7. Pitfalls / common mistakes - **Forgetting `with_claim_report`**: `RunResult` is frozen. Mutating `result.claim_report` directly is impossible; you must use `with_claim_report(result, report)` to get a new `RunResult` with the evidence attached. - **Assuming `claim_report` is auto-populated**: evaluation does not auto-evaluate claims. You must explicitly call `evaluate_claims(...)` and attach the result. The opt-in is deliberate — claims are caller-defined and the harness shouldn't guess them. - **Confusing `evaluate_claims` argument order**: the signature is `evaluate_claims(result, claim_specs, *, manifest=None)` — result first, specs second. Not `(specs, result)`. - **Treating `additionalProperties: true` as "unstable"**: it's the forward-compat *contract*. v0.8 consumers tolerate v0.9's new fields; v0.9 consumers tolerate hypothetical future additions under the same `.v1.json` filename. Filename bumps signal real breakage; field additions don't. - **Using `jsonschema` without the extra**: `validate_payload` raises `ImportError` with installation guidance. The lazy import is intentional — core scientific use cases don't need it. ## See also - [`CHANGELOG.md`](https://github.com/brandon-behring/eval-toolkit/blob/main/CHANGELOG.md) `[0.9.0]` and `[0.9.1]` blocks for the per-version release notes. - [`methodology/evidence.md`](../methodology/evidence.md) for the source-role / claim-gate methodology these primitives encode. - [`methodology/versioning.md`](../methodology/versioning.md) § Schema evolution policy — when schema filenames bump vs. stay. - [`migration/v0.8.md`](v0.8.md) for the v0.7 → v0.8 step (read first if you're upgrading from v0.7.x). - [`migration/v0.7.md`](v0.7.md) for the v0.6 → v0.7 step.