# v0.8.x → v0.9.0 migration

v0.9.0 is **an additive feature release** introducing the *evidence
core*: six new public modules (`claims`, `artifacts`, `evidence`,
`operating_points`, `analysis`, `protocols`) and six new optional
`RunResult` fields. No v0.8 public API was removed, renamed, or
behavior-changed. Existing v0.8 harness code keeps working unchanged;
v0.9 features are opt-in.

This guide covers what's new, when to adopt each piece, and how to
extend a v0.8 harness with v0.9 evidence gates.

## At a glance

| Change | Type | Action |
|---|---|---|
| New `eval_toolkit.claims` module — `ClaimSpec`, `EvidenceGate`, `evaluate_claims` | Added | Optional: adopt to encode pass/fail evidence per claim. |
| New `eval_toolkit.artifacts` module — `PredictionArtifactRef`, `validate_payload` | Added | Optional: serialize prediction artifacts for downstream consumers. |
| New `eval_toolkit.evidence` module — `EvidenceAxis`, `PairingMetadata`, `AggregateEvidence` | Added | Optional: declare how aggregated evidence was produced. |
| New `eval_toolkit.operating_points` module — `OperatingPointSpec`, `fit_operating_points`, `apply_operating_points` | Added | Optional: transfer a fitted operating point across slices. |
| New `eval_toolkit.analysis` module — `bootstrap_metric_from_predictions`, `paired_diff_from_prediction_refs` | Added | Optional: bootstrap / paired-diff from saved predictions. |
| New `eval_toolkit.protocols` module — `Scorer`, `SliceAwareScorer`, `EvalSliceLike`, `PredictionReader`, `Versioned` | Added | Pandas-free Protocol home. Re-exported from `eval_toolkit` and `eval_toolkit.harness`. |
| `RunResult` gained 6 optional fields | Added | Optional: populate to publish v0.9 evidence; defaults preserve v0.8 shape. |
| `build_manifest` gained `source_roles`, `required_source_roles`, `prediction_artifacts` kwargs | Added | Optional: declare data lineage and artifact provenance. |
| New manifest schema `manifest.v1.json` | Added | Optional: validate manifests with `validate_payload`. |
| `results.v1.json` / `results_full.v1.json` gained optional top-level fields | Changed (additive) | None — consumers tolerate via `additionalProperties: true`. |
| New `[validation]` extra ships `jsonschema>=4.21` | Added | Install if you use `validate_payload`. |

## 1. New module map

### `eval_toolkit.claims`

Generic evidence gates. Encode the conditions a claim must satisfy
before you treat the metric report as supporting it. Ships ~12 gate
constructors covering required-slice / required-metric / source-role
/ low-FPR-feasibility / strict-artifact / external-diagnostic / etc.

Public surface: `ClaimSpec`, `ClaimReport`, `EvidenceGate`,
`GateResult`, `evaluate_claims`, and the gate constructors.

### `eval_toolkit.artifacts`

Prediction-artifact references and payload validation. A
`PredictionArtifactRef` declares: where the per-row prediction file
lives (URI), what its bytes hash to (`sha256`), and how its columns
map to the canonical `(row_id, label, score, scorer, slice, …)`
contract. `validate_payload(payload, schema_filename)` runs the
schema in `eval_toolkit/schemas/` against the payload (requires
`pip install "eval-toolkit[validation]"`).

### `eval_toolkit.evidence`

Three small dataclasses for declaring aggregated-evidence provenance:
`EvidenceAxis` (one dimension of aggregation — slice, scorer, fold,
…), `PairingMetadata` (was the bootstrap paired? what was the unit?),
`AggregateEvidence` (the rolled-up status + method + axes).

### `eval_toolkit.operating_points`

`OperatingPointSpec` declares a fit target (e.g. "min recall ≥ 0.7
subject to FPR ≤ 0.05"); `fit_operating_points` produces a
`FittedOperatingPoint` (threshold + diagnostics);
`apply_operating_points` applies the fitted threshold to a different
slice's scores. Enables principled cross-slice threshold transfer.

### `eval_toolkit.analysis`

Filesystem-aware helpers that bridge prediction artifacts and the
bootstrap kernel: `bootstrap_metric_from_predictions` and
`paired_diff_from_prediction_refs`. Use when you've stored predictions
to disk and want to re-derive CIs without re-running scorers.

### `eval_toolkit.protocols`

Lightweight Protocols with **zero runtime pandas dependency** — the
home for `Scorer`, `SliceAwareScorer`, `EvalSliceLike`,
`PredictionReader`, `Versioned`. Pandas types appear in annotations
only via `TYPE_CHECKING`. Re-exported from both
`eval_toolkit` (top level) and `eval_toolkit.harness` for backward
compatibility — consumer imports don't change.

## 2. `RunResult` field additions

Six new fields, all optional, all default to empty containers:

| Field | Type | What it carries |
|---|---|---|
| `claim_report` | `dict[str, object]` | Serialized `ClaimReport.to_dict()`. |
| `prediction_artifacts` | `list[dict[str, object]]` | List of `PredictionArtifactRef.to_dict()`. |
| `evidence_axes` | `list[dict[str, object]]` | List of `EvidenceAxis.to_dict()`. |
| `pairing_metadata` | `dict[str, object]` | `PairingMetadata.to_dict()`. |
| `aggregate_evidence` | `dict[str, object]` | `AggregateEvidence.to_dict()`. |
| `threshold_policy` | `dict[str, object]` | `ThresholdPolicyMetadata.to_dict()`. |

v0.8 construction still works — no kwargs needed:

```python
from eval_toolkit.harness import RunResult
result = RunResult(run_id="r", git_sha=None, config={}, by_slice={})
assert result.claim_report == {}
assert result.prediction_artifacts == []
```

v0.9 enriched construction — pass any subset:

```python
from eval_toolkit.evidence import EvidenceAxis
from eval_toolkit.harness import RunResult
result = RunResult(
    run_id="r",
    git_sha=None,
    config={},
    by_slice={},
    evidence_axes=[EvidenceAxis(name="slice", value="locked_eval").to_dict()],
)
assert result.evidence_axes[0]["name"] == "slice"
```

## 3. End-to-end walkthrough — extending a v0.8 harness with v0.9 evidence

This section shows a minimal v0.8 harness and how to extend it with
v0.9 claim gates without rewriting the eval loop.

### v0.8: bare metric report

A v0.8 harness scores one model on one slice and writes a result:

```python
import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, evaluate

class TinyScorer:
    """Toy scorer that returns calibrated noise."""
    def predict_proba(self, X):
        rng = np.random.default_rng(0)
        return rng.uniform(0.2, 0.8, size=len(X))

df = pd.DataFrame({"text": [f"r{i}" for i in range(40)], "label": [0, 1] * 20})
slices = [EvalSlice(name="locked_eval", df=df)]
result = evaluate({"tiny": TinyScorer()}, slices, run_id="v08-demo")
assert "locked_eval" in result.by_slice
```

That's a complete v0.8 run. `result.by_slice["locked_eval"]` carries
the metrics. No claim evidence is attached — consumers eyeball the
numbers.

### v0.9: same harness + claim gates

The v0.9 upgrade keeps the eval loop unchanged. After `evaluate`,
build a `ClaimSpec` listing the evidence the claim requires, run
`evaluate_claims`, and attach the report:

```python
import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice, evaluate
from eval_toolkit.claims import (
    ClaimSpec, evaluate_claims,
    required_slice_gate, required_scorer_gate, minimum_slice_size_gate,
)
from eval_toolkit.harness import with_claim_report

class TinyScorer:
    def predict_proba(self, X):
        rng = np.random.default_rng(0)
        return rng.uniform(0.2, 0.8, size=len(X))

df = pd.DataFrame({"text": [f"r{i}" for i in range(40)], "label": [0, 1] * 20})
slices = [EvalSlice(name="locked_eval", df=df)]
result = evaluate({"tiny": TinyScorer()}, slices, run_id="v09-demo")

# v0.9 addition: declare the evidence the "tiny works on locked_eval"
# claim requires, then evaluate it against the result payload.
claim = ClaimSpec(
    name="tiny works on locked_eval",
    gates=(
        required_slice_gate("locked_eval"),
        required_scorer_gate("locked_eval", "tiny"),
        minimum_slice_size_gate("locked_eval", min_n=20, min_positive=10),
    ),
)
report = evaluate_claims(result.to_dict(), [claim])

# Frozen-by-value attachment — RunResult itself is immutable.
enriched = with_claim_report(result, report)

assert not report.has_failures()
assert enriched.claim_report["has_failures"] is False
assert len(enriched.claim_report["claims"]["tiny works on locked_eval"]) == 3
```

The harness loop didn't change. The only new code is the `ClaimSpec`
declaration, the `evaluate_claims` call, and the
`with_claim_report` attachment.

### Failure path: what a missing gate looks like

If the claim requires a slice that wasn't produced, the report
records the failing gate without aborting:

```python
from eval_toolkit.claims import ClaimSpec, evaluate_claims, required_slice_gate
claim = ClaimSpec(name="ood claim", gates=(required_slice_gate("ood_unseen"),))
empty_result = {"by_slice": {}}
report = evaluate_claims(empty_result, [claim])
assert report.has_failures()
gate_result = report.claims["ood claim"][0]
assert gate_result.passed is False
assert "missing slice" in gate_result.message
```

This pattern lets your CI gate publication on `report.has_failures()`
without sprinkling assertions throughout the eval code.

## 4. `Scorer` Protocol consolidation (v0.9.0 + v0.9.1)

v0.9.0 introduced `eval_toolkit.protocols` as the pandas-free home
for `Scorer` and `SliceAwareScorer` but **left duplicate copies in
`eval_toolkit.harness`**. v0.9.1 removes the duplicates;
`harness.py` now imports from `protocols.py`. The top-level re-export
`from eval_toolkit import Scorer` resolves to the same class as before:

```python
from eval_toolkit import Scorer
from eval_toolkit.harness import Scorer as HarnessScorer
from eval_toolkit.protocols import Scorer as ProtocolsScorer
assert Scorer is HarnessScorer is ProtocolsScorer
assert Scorer.__module__ == "eval_toolkit.protocols"
```

The canonical `Scorer.predict_proba` signature still accepts
`list[str] | np.ndarray | pd.Series`. Pandas is imported under
`TYPE_CHECKING` only in `protocols.py`, so `import eval_toolkit` does
not pay a pandas import cost when pandas is uninstalled.

## 5. Schema additions

`results.v1.json` and `results_full.v1.json` gained these top-level
optional fields, mirroring the `RunResult` additions in section 2:
`claim_report`, `prediction_artifacts`, `evidence_axes`,
`pairing_metadata`, `aggregate_evidence`, `threshold_policy`. The
schemas keep the `.v1` filename because every change is additive and
the schemas declare `additionalProperties: true` — v0.8 consumers
read v0.9 outputs without error.

A new schema `manifest.v1.json` lands for validating
`RunManifest.to_dict()` payloads. See
[`methodology/versioning.md` § Schema evolution policy](../methodology/versioning.md#schema-evolution)
for the policy that governs when a filename gets bumped to `.v2`.

## 6. The `validation` extra

`jsonschema>=4.21` moved into a focused `[validation]` extra in
v0.9. Install only if you use `eval_toolkit.artifacts.validate_payload`:

```bash
pip install "eval-toolkit[validation]"
```

`validate_payload` lazy-imports jsonschema; if it's missing, it
raises `ImportError` with a clear pointer to the extra. The `[all]`
and `[dev]` extras include it transitively, so most users won't need
to install it explicitly.

## 7. Pitfalls / common mistakes

- **Forgetting `with_claim_report`**: `RunResult` is frozen.
  Mutating `result.claim_report` directly is impossible; you must
  use `with_claim_report(result, report)` to get a new `RunResult`
  with the evidence attached.
- **Assuming `claim_report` is auto-populated**: evaluation does
  not auto-evaluate claims. You must explicitly call
  `evaluate_claims(...)` and attach the result. The opt-in is
  deliberate — claims are caller-defined and the harness shouldn't
  guess them.
- **Confusing `evaluate_claims` argument order**: the signature is
  `evaluate_claims(result, claim_specs, *, manifest=None)` — result
  first, specs second. Not `(specs, result)`.
- **Treating `additionalProperties: true` as "unstable"**: it's the
  forward-compat *contract*. v0.8 consumers tolerate v0.9's new
  fields; v0.9 consumers tolerate hypothetical future additions
  under the same `.v1.json` filename. Filename bumps signal real
  breakage; field additions don't.
- **Using `jsonschema` without the extra**: `validate_payload`
  raises `ImportError` with installation guidance. The lazy import
  is intentional — core scientific use cases don't need it.

## See also

- [`CHANGELOG.md`](https://github.com/brandon-behring/eval-toolkit/blob/main/CHANGELOG.md) `[0.9.0]` and `[0.9.1]`
  blocks for the per-version release notes.
- [`methodology/evidence.md`](../methodology/evidence.md) for the
  source-role / claim-gate methodology these primitives encode.
- [`methodology/versioning.md`](../methodology/versioning.md)
  § Schema evolution policy — when schema filenames bump vs. stay.
- [`migration/v0.8.md`](v0.8.md) for the v0.7 → v0.8 step (read
  first if you're upgrading from v0.7.x).
- [`migration/v0.7.md`](v0.7.md) for the v0.6 → v0.7 step.