v0.8.x → v0.9.0 migration#

v0.9.0 is an additive feature release introducing the evidence core: six new public modules (claims, artifacts, evidence, operating_points, analysis, protocols) and six new optional RunResult fields. No v0.8 public API was removed, renamed, or behavior-changed. Existing v0.8 harness code keeps working unchanged; v0.9 features are opt-in.

This guide covers what’s new, when to adopt each piece, and how to extend a v0.8 harness with v0.9 evidence gates.

At a glance#

Change

Type

Action

New eval_toolkit.claims module — ClaimSpec, EvidenceGate, evaluate_claims

Added

Optional: adopt to encode pass/fail evidence per claim.

New eval_toolkit.artifacts module — PredictionArtifactRef, validate_payload

Added

Optional: serialize prediction artifacts for downstream consumers.

New eval_toolkit.evidence module — EvidenceAxis, PairingMetadata, AggregateEvidence

Added

Optional: declare how aggregated evidence was produced.

New eval_toolkit.operating_points module — OperatingPointSpec, fit_operating_points, apply_operating_points

Added

Optional: transfer a fitted operating point across slices.

New eval_toolkit.analysis module — bootstrap_metric_from_predictions, paired_diff_from_prediction_refs

Added

Optional: bootstrap / paired-diff from saved predictions.

New eval_toolkit.protocols module — Scorer, SliceAwareScorer, EvalSliceLike, PredictionReader, Versioned

Added

Pandas-free Protocol home. Re-exported from eval_toolkit and eval_toolkit.harness.

RunResult gained 6 optional fields

Added

Optional: populate to publish v0.9 evidence; defaults preserve v0.8 shape.

build_manifest gained source_roles, required_source_roles, prediction_artifacts kwargs

Added

Optional: declare data lineage and artifact provenance.

New manifest schema manifest.v1.json

Added

Optional: validate manifests with validate_payload.

results.v1.json / results_full.v1.json gained optional top-level fields

Changed (additive)

None — consumers tolerate via additionalProperties: true.

New [validation] extra ships jsonschema>=4.21

Added

Install if you use validate_payload.

1. New module map#

eval_toolkit.claims#

Generic evidence gates. Encode the conditions a claim must satisfy before you treat the metric report as supporting it. Ships ~12 gate constructors covering required-slice / required-metric / source-role / low-FPR-feasibility / strict-artifact / external-diagnostic / etc.

Public surface: ClaimSpec, ClaimReport, EvidenceGate, GateResult, evaluate_claims, and the gate constructors.

eval_toolkit.artifacts#

Prediction-artifact references and payload validation. A PredictionArtifactRef declares: where the per-row prediction file lives (URI), what its bytes hash to (sha256), and how its columns map to the canonical (row_id, label, score, scorer, slice, …) contract. validate_payload(payload, schema_filename) runs the schema in eval_toolkit/schemas/ against the payload (requires pip install "eval-toolkit[validation]").

eval_toolkit.evidence#

Three small dataclasses for declaring aggregated-evidence provenance: EvidenceAxis (one dimension of aggregation — slice, scorer, fold, …), PairingMetadata (was the bootstrap paired? what was the unit?), AggregateEvidence (the rolled-up status + method + axes).

eval_toolkit.operating_points#

OperatingPointSpec declares a fit target (e.g. “min recall ≥ 0.7 subject to FPR ≤ 0.05”); fit_operating_points produces a FittedOperatingPoint (threshold + diagnostics); apply_operating_points applies the fitted threshold to a different slice’s scores. Enables principled cross-slice threshold transfer.

eval_toolkit.analysis#

Filesystem-aware helpers that bridge prediction artifacts and the bootstrap kernel: bootstrap_metric_from_predictions and paired_diff_from_prediction_refs. Use when you’ve stored predictions to disk and want to re-derive CIs without re-running scorers.

eval_toolkit.protocols#

Lightweight Protocols with zero runtime pandas dependency — the home for Scorer, SliceAwareScorer, EvalSliceLike, PredictionReader, Versioned. Pandas types appear in annotations only via TYPE_CHECKING. Re-exported from both eval_toolkit (top level) and eval_toolkit.harness for backward compatibility — consumer imports don’t change.

2. RunResult field additions#

Six new fields, all optional, all default to empty containers:

Field

Type

What it carries

claim_report

dict[str, object]

Serialized ClaimReport.to_dict().

prediction_artifacts

list[dict[str, object]]

List of PredictionArtifactRef.to_dict().

evidence_axes

list[dict[str, object]]

List of EvidenceAxis.to_dict().

pairing_metadata

dict[str, object]

PairingMetadata.to_dict().

aggregate_evidence

dict[str, object]

AggregateEvidence.to_dict().

threshold_policy

dict[str, object]

ThresholdPolicyMetadata.to_dict().

v0.8 construction still works — no kwargs needed:

from eval_toolkit.harness import RunResult
result = RunResult(run_id="r", git_sha=None, config={}, by_slice={})
assert result.claim_report == {}
assert result.prediction_artifacts == []

v0.9 enriched construction — pass any subset:

from eval_toolkit.evidence import EvidenceAxis
from eval_toolkit.harness import RunResult
result = RunResult(
    run_id="r",
    git_sha=None,
    config={},
    by_slice={},
    evidence_axes=[EvidenceAxis(name="slice", value="locked_eval").to_dict()],
)
assert result.evidence_axes[0]["name"] == "slice"

3. End-to-end walkthrough — extending a v0.8 harness with v0.9 evidence#

This section shows a minimal v0.8 harness and how to extend it with v0.9 claim gates without rewriting the eval loop.

v0.8: bare metric report#

A v0.8 harness scores one model on one slice and writes a result:

import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, evaluate

class TinyScorer:
    """Toy scorer that returns calibrated noise."""
    def predict_proba(self, X):
        rng = np.random.default_rng(0)
        return rng.uniform(0.2, 0.8, size=len(X))

df = pd.DataFrame({"text": [f"r{i}" for i in range(40)], "label": [0, 1] * 20})
slices = [EvalSlice(name="locked_eval", df=df)]
result = evaluate({"tiny": TinyScorer()}, slices, run_id="v08-demo")
assert "locked_eval" in result.by_slice

That’s a complete v0.8 run. result.by_slice["locked_eval"] carries the metrics. No claim evidence is attached — consumers eyeball the numbers.

v0.9: same harness + claim gates#

The v0.9 upgrade keeps the eval loop unchanged. After evaluate, build a ClaimSpec listing the evidence the claim requires, run evaluate_claims, and attach the report:

import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, evaluate
from eval_toolkit.claims import (
    ClaimSpec, evaluate_claims,
    required_slice_gate, required_scorer_gate, minimum_slice_size_gate,
)
from eval_toolkit.harness import with_claim_report

class TinyScorer:
    def predict_proba(self, X):
        rng = np.random.default_rng(0)
        return rng.uniform(0.2, 0.8, size=len(X))

df = pd.DataFrame({"text": [f"r{i}" for i in range(40)], "label": [0, 1] * 20})
slices = [EvalSlice(name="locked_eval", df=df)]
result = evaluate({"tiny": TinyScorer()}, slices, run_id="v09-demo")

# v0.9 addition: declare the evidence the "tiny works on locked_eval"
# claim requires, then evaluate it against the result payload.
claim = ClaimSpec(
    name="tiny works on locked_eval",
    gates=(
        required_slice_gate("locked_eval"),
        required_scorer_gate("locked_eval", "tiny"),
        minimum_slice_size_gate("locked_eval", min_n=20, min_positive=10),
    ),
)
report = evaluate_claims(result.to_dict(), [claim])

# Frozen-by-value attachment — RunResult itself is immutable.
enriched = with_claim_report(result, report)

assert not report.has_failures()
assert enriched.claim_report["has_failures"] is False
assert len(enriched.claim_report["claims"]["tiny works on locked_eval"]) == 3

The harness loop didn’t change. The only new code is the ClaimSpec declaration, the evaluate_claims call, and the with_claim_report attachment.

Failure path: what a missing gate looks like#

If the claim requires a slice that wasn’t produced, the report records the failing gate without aborting:

from eval_toolkit.claims import ClaimSpec, evaluate_claims, required_slice_gate
claim = ClaimSpec(name="ood claim", gates=(required_slice_gate("ood_unseen"),))
empty_result = {"by_slice": {}}
report = evaluate_claims(empty_result, [claim])
assert report.has_failures()
gate_result = report.claims["ood claim"][0]
assert gate_result.passed is False
assert "missing slice" in gate_result.message

This pattern lets your CI gate publication on report.has_failures() without sprinkling assertions throughout the eval code.

4. Scorer Protocol consolidation (v0.9.0 + v0.9.1)#

v0.9.0 introduced eval_toolkit.protocols as the pandas-free home for Scorer and SliceAwareScorer but left duplicate copies in eval_toolkit.harness. v0.9.1 removes the duplicates; harness.py now imports from protocols.py. The top-level re-export from eval_toolkit import Scorer resolves to the same class as before:

from eval_toolkit import Scorer
from eval_toolkit.harness import Scorer as HarnessScorer
from eval_toolkit.protocols import Scorer as ProtocolsScorer
assert Scorer is HarnessScorer is ProtocolsScorer
assert Scorer.__module__ == "eval_toolkit.protocols"

The canonical Scorer.predict_proba signature still accepts list[str] | np.ndarray | pd.Series. Pandas is imported under TYPE_CHECKING only in protocols.py, so import eval_toolkit does not pay a pandas import cost when pandas is uninstalled.

5. Schema additions#

results.v1.json and results_full.v1.json gained these top-level optional fields, mirroring the RunResult additions in section 2: claim_report, prediction_artifacts, evidence_axes, pairing_metadata, aggregate_evidence, threshold_policy. The schemas keep the .v1 filename because every change is additive and the schemas declare additionalProperties: true — v0.8 consumers read v0.9 outputs without error.

A new schema manifest.v1.json lands for validating RunManifest.to_dict() payloads. See methodology/versioning.md § Schema evolution policy for the policy that governs when a filename gets bumped to .v2.

6. The validation extra#

jsonschema>=4.21 moved into a focused [validation] extra in v0.9. Install only if you use eval_toolkit.artifacts.validate_payload:

pip install "eval-toolkit[validation]"

validate_payload lazy-imports jsonschema; if it’s missing, it raises ImportError with a clear pointer to the extra. The [all] and [dev] extras include it transitively, so most users won’t need to install it explicitly.

7. Pitfalls / common mistakes#

  • Forgetting with_claim_report: RunResult is frozen. Mutating result.claim_report directly is impossible; you must use with_claim_report(result, report) to get a new RunResult with the evidence attached.

  • Assuming claim_report is auto-populated: evaluation does not auto-evaluate claims. You must explicitly call evaluate_claims(...) and attach the result. The opt-in is deliberate — claims are caller-defined and the harness shouldn’t guess them.

  • Confusing evaluate_claims argument order: the signature is evaluate_claims(result, claim_specs, *, manifest=None) — result first, specs second. Not (specs, result).

  • Treating additionalProperties: true as “unstable”: it’s the forward-compat contract. v0.8 consumers tolerate v0.9’s new fields; v0.9 consumers tolerate hypothetical future additions under the same .v1.json filename. Filename bumps signal real breakage; field additions don’t.

  • Using jsonschema without the extra: validate_payload raises ImportError with installation guidance. The lazy import is intentional — core scientific use cases don’t need it.

See also#