v0.8.x → v0.9.0 migration#
v0.9.0 is an additive feature release introducing the evidence
core: six new public modules (claims, artifacts, evidence,
operating_points, analysis, protocols) and six new optional
RunResult fields. No v0.8 public API was removed, renamed, or
behavior-changed. Existing v0.8 harness code keeps working unchanged;
v0.9 features are opt-in.
This guide covers what’s new, when to adopt each piece, and how to extend a v0.8 harness with v0.9 evidence gates.
At a glance#
Change |
Type |
Action |
|---|---|---|
New |
Added |
Optional: adopt to encode pass/fail evidence per claim. |
New |
Added |
Optional: serialize prediction artifacts for downstream consumers. |
New |
Added |
Optional: declare how aggregated evidence was produced. |
New |
Added |
Optional: transfer a fitted operating point across slices. |
New |
Added |
Optional: bootstrap / paired-diff from saved predictions. |
New |
Added |
Pandas-free Protocol home. Re-exported from |
|
Added |
Optional: populate to publish v0.9 evidence; defaults preserve v0.8 shape. |
|
Added |
Optional: declare data lineage and artifact provenance. |
New manifest schema |
Added |
Optional: validate manifests with |
|
Changed (additive) |
None — consumers tolerate via |
New |
Added |
Install if you use |
1. New module map#
eval_toolkit.claims#
Generic evidence gates. Encode the conditions a claim must satisfy before you treat the metric report as supporting it. Ships ~12 gate constructors covering required-slice / required-metric / source-role / low-FPR-feasibility / strict-artifact / external-diagnostic / etc.
Public surface: ClaimSpec, ClaimReport, EvidenceGate,
GateResult, evaluate_claims, and the gate constructors.
eval_toolkit.artifacts#
Prediction-artifact references and payload validation. A
PredictionArtifactRef declares: where the per-row prediction file
lives (URI), what its bytes hash to (sha256), and how its columns
map to the canonical (row_id, label, score, scorer, slice, …)
contract. validate_payload(payload, schema_filename) runs the
schema in eval_toolkit/schemas/ against the payload (requires
pip install "eval-toolkit[validation]").
eval_toolkit.evidence#
Three small dataclasses for declaring aggregated-evidence provenance:
EvidenceAxis (one dimension of aggregation — slice, scorer, fold,
…), PairingMetadata (was the bootstrap paired? what was the unit?),
AggregateEvidence (the rolled-up status + method + axes).
eval_toolkit.operating_points#
OperatingPointSpec declares a fit target (e.g. “min recall ≥ 0.7
subject to FPR ≤ 0.05”); fit_operating_points produces a
FittedOperatingPoint (threshold + diagnostics);
apply_operating_points applies the fitted threshold to a different
slice’s scores. Enables principled cross-slice threshold transfer.
eval_toolkit.analysis#
Filesystem-aware helpers that bridge prediction artifacts and the
bootstrap kernel: bootstrap_metric_from_predictions and
paired_diff_from_prediction_refs. Use when you’ve stored predictions
to disk and want to re-derive CIs without re-running scorers.
eval_toolkit.protocols#
Lightweight Protocols with zero runtime pandas dependency — the
home for Scorer, SliceAwareScorer, EvalSliceLike,
PredictionReader, Versioned. Pandas types appear in annotations
only via TYPE_CHECKING. Re-exported from both
eval_toolkit (top level) and eval_toolkit.harness for backward
compatibility — consumer imports don’t change.
2. RunResult field additions#
Six new fields, all optional, all default to empty containers:
Field |
Type |
What it carries |
|---|---|---|
|
|
Serialized |
|
|
List of |
|
|
List of |
|
|
|
|
|
|
|
|
|
v0.8 construction still works — no kwargs needed:
from eval_toolkit.harness import RunResult
result = RunResult(run_id="r", git_sha=None, config={}, by_slice={})
assert result.claim_report == {}
assert result.prediction_artifacts == []
v0.9 enriched construction — pass any subset:
from eval_toolkit.evidence import EvidenceAxis
from eval_toolkit.harness import RunResult
result = RunResult(
run_id="r",
git_sha=None,
config={},
by_slice={},
evidence_axes=[EvidenceAxis(name="slice", value="locked_eval").to_dict()],
)
assert result.evidence_axes[0]["name"] == "slice"
3. End-to-end walkthrough — extending a v0.8 harness with v0.9 evidence#
This section shows a minimal v0.8 harness and how to extend it with v0.9 claim gates without rewriting the eval loop.
v0.8: bare metric report#
A v0.8 harness scores one model on one slice and writes a result:
import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, evaluate
class TinyScorer:
"""Toy scorer that returns calibrated noise."""
def predict_proba(self, X):
rng = np.random.default_rng(0)
return rng.uniform(0.2, 0.8, size=len(X))
df = pd.DataFrame({"text": [f"r{i}" for i in range(40)], "label": [0, 1] * 20})
slices = [EvalSlice(name="locked_eval", df=df)]
result = evaluate({"tiny": TinyScorer()}, slices, run_id="v08-demo")
assert "locked_eval" in result.by_slice
That’s a complete v0.8 run. result.by_slice["locked_eval"] carries
the metrics. No claim evidence is attached — consumers eyeball the
numbers.
v0.9: same harness + claim gates#
The v0.9 upgrade keeps the eval loop unchanged. After evaluate,
build a ClaimSpec listing the evidence the claim requires, run
evaluate_claims, and attach the report:
import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, evaluate
from eval_toolkit.claims import (
ClaimSpec, evaluate_claims,
required_slice_gate, required_scorer_gate, minimum_slice_size_gate,
)
from eval_toolkit.harness import with_claim_report
class TinyScorer:
def predict_proba(self, X):
rng = np.random.default_rng(0)
return rng.uniform(0.2, 0.8, size=len(X))
df = pd.DataFrame({"text": [f"r{i}" for i in range(40)], "label": [0, 1] * 20})
slices = [EvalSlice(name="locked_eval", df=df)]
result = evaluate({"tiny": TinyScorer()}, slices, run_id="v09-demo")
# v0.9 addition: declare the evidence the "tiny works on locked_eval"
# claim requires, then evaluate it against the result payload.
claim = ClaimSpec(
name="tiny works on locked_eval",
gates=(
required_slice_gate("locked_eval"),
required_scorer_gate("locked_eval", "tiny"),
minimum_slice_size_gate("locked_eval", min_n=20, min_positive=10),
),
)
report = evaluate_claims(result.to_dict(), [claim])
# Frozen-by-value attachment — RunResult itself is immutable.
enriched = with_claim_report(result, report)
assert not report.has_failures()
assert enriched.claim_report["has_failures"] is False
assert len(enriched.claim_report["claims"]["tiny works on locked_eval"]) == 3
The harness loop didn’t change. The only new code is the ClaimSpec
declaration, the evaluate_claims call, and the
with_claim_report attachment.
Failure path: what a missing gate looks like#
If the claim requires a slice that wasn’t produced, the report records the failing gate without aborting:
from eval_toolkit.claims import ClaimSpec, evaluate_claims, required_slice_gate
claim = ClaimSpec(name="ood claim", gates=(required_slice_gate("ood_unseen"),))
empty_result = {"by_slice": {}}
report = evaluate_claims(empty_result, [claim])
assert report.has_failures()
gate_result = report.claims["ood claim"][0]
assert gate_result.passed is False
assert "missing slice" in gate_result.message
This pattern lets your CI gate publication on report.has_failures()
without sprinkling assertions throughout the eval code.
4. Scorer Protocol consolidation (v0.9.0 + v0.9.1)#
v0.9.0 introduced eval_toolkit.protocols as the pandas-free home
for Scorer and SliceAwareScorer but left duplicate copies in
eval_toolkit.harness. v0.9.1 removes the duplicates;
harness.py now imports from protocols.py. The top-level re-export
from eval_toolkit import Scorer resolves to the same class as before:
from eval_toolkit import Scorer
from eval_toolkit.harness import Scorer as HarnessScorer
from eval_toolkit.protocols import Scorer as ProtocolsScorer
assert Scorer is HarnessScorer is ProtocolsScorer
assert Scorer.__module__ == "eval_toolkit.protocols"
The canonical Scorer.predict_proba signature still accepts
list[str] | np.ndarray | pd.Series. Pandas is imported under
TYPE_CHECKING only in protocols.py, so import eval_toolkit does
not pay a pandas import cost when pandas is uninstalled.
5. Schema additions#
results.v1.json and results_full.v1.json gained these top-level
optional fields, mirroring the RunResult additions in section 2:
claim_report, prediction_artifacts, evidence_axes,
pairing_metadata, aggregate_evidence, threshold_policy. The
schemas keep the .v1 filename because every change is additive and
the schemas declare additionalProperties: true — v0.8 consumers
read v0.9 outputs without error.
A new schema manifest.v1.json lands for validating
RunManifest.to_dict() payloads. See
methodology/versioning.md § Schema evolution policy
for the policy that governs when a filename gets bumped to .v2.
6. The validation extra#
jsonschema>=4.21 moved into a focused [validation] extra in
v0.9. Install only if you use eval_toolkit.artifacts.validate_payload:
pip install "eval-toolkit[validation]"
validate_payload lazy-imports jsonschema; if it’s missing, it
raises ImportError with a clear pointer to the extra. The [all]
and [dev] extras include it transitively, so most users won’t need
to install it explicitly.
7. Pitfalls / common mistakes#
Forgetting
with_claim_report:RunResultis frozen. Mutatingresult.claim_reportdirectly is impossible; you must usewith_claim_report(result, report)to get a newRunResultwith the evidence attached.Assuming
claim_reportis auto-populated: evaluation does not auto-evaluate claims. You must explicitly callevaluate_claims(...)and attach the result. The opt-in is deliberate — claims are caller-defined and the harness shouldn’t guess them.Confusing
evaluate_claimsargument order: the signature isevaluate_claims(result, claim_specs, *, manifest=None)— result first, specs second. Not(specs, result).Treating
additionalProperties: trueas “unstable”: it’s the forward-compat contract. v0.8 consumers tolerate v0.9’s new fields; v0.9 consumers tolerate hypothetical future additions under the same.v1.jsonfilename. Filename bumps signal real breakage; field additions don’t.Using
jsonschemawithout the extra:validate_payloadraisesImportErrorwith installation guidance. The lazy import is intentional — core scientific use cases don’t need it.
See also#
CHANGELOG.md[0.9.0]and[0.9.1]blocks for the per-version release notes.methodology/evidence.mdfor the source-role / claim-gate methodology these primitives encode.methodology/versioning.md§ Schema evolution policy — when schema filenames bump vs. stay.migration/v0.8.mdfor the v0.7 → v0.8 step (read first if you’re upgrading from v0.7.x).migration/v0.7.mdfor the v0.6 → v0.7 step.