# Prediction Artifacts and Metric States

This chapter covers `eval_toolkit.artifacts` — the layer that lets you
**separate predictions from the metrics computed on them**. The toolkit
exposes three contracts: `PredictionArtifactRef` (manifest pointer to
on-disk predictions), `PredictionColumns` (column-mapping schema), and
`MetricState` (typed status wrapper for metrics that may be skipped or
errored). Together they answer a single question: *can a downstream
consumer replay this evaluation without re-running inference?*

> **Background.** This chapter assumes you've produced a `RunResult`
> (see [getting-started](../getting-started.md)) and have either a CSV,
> JSONL, or Parquet file of per-row predictions. The
> [methodology curriculum](README.md) covers the metric kernels
> ([API: metrics](../api/metrics.md)), bootstrap CI shapes
> ([bootstrap.md](bootstrap.md)), and the broader claim model
> ([claims.md](claims.md)). What's *not* covered elsewhere is the
> data contract that ties them together: how the bytes of a
> `predictions.csv` connect to a typed reference in
> `manifest.json` and the optional schema-validated payload that
> ships next to it.

(why)=
## Why a separate artifact layer?
Computing metrics from predictions in-memory is cheap. The problem
appears later, when a stakeholder asks:

- "What were the actual scores for the rows that landed in the FPR-low
  region of the curve?"
- "Did the v0.8 release use the same eval predictions as the v0.9
  release, or did the scorer drift?"
- "Can we re-compute PR-AUC under a new bootstrap method without
  re-running inference on 50K rows?"

The fix is to **persist predictions as a first-class artifact** with a
stable on-disk contract. That artifact then carries enough metadata
(`uri`, `media_type`, `columns`, `sha256`, `n_rows`) for any future
analysis pipeline to load, validate, and recompute against it.

(artifacts-data-model)=
## The data model
Three frozen dataclasses make up the public contract:

(prediction-columns)=
### PredictionColumns
The column-mapping schema. Required: `label` and `score`. Optional but
strongly recommended for paired-diff workflows: `row_id` and
`content_hash`.

```python
from eval_toolkit.artifacts import PredictionColumns

columns = PredictionColumns(
    label="y_true",
    score="prob_positive",
    row_id="example_id",
    content_hash="text_sha256",
)
assert columns.label == "y_true"
```

`row_id` lets downstream code align predictions across two runs of
the same scorer for paired bootstrap diffs. `content_hash` detects
silent label-drift or text-mutation: same `row_id` + different
`content_hash` means the row is *not* the same row.

(artifact-ref)=
### PredictionArtifactRef
The manifest-level pointer to a persisted prediction file. Carries the
URI, MIME-style media type, the column mapping, plus integrity
metadata (`sha256`, `n_rows`).

```python
from eval_toolkit.artifacts import PredictionArtifactRef, PredictionColumns

ref = PredictionArtifactRef(
    uri="s3://my-eval-bucket/run-42/predictions.csv",
    media_type="text/csv",
    columns=PredictionColumns(label="y", score="s"),
    sha256="a" * 64,
    n_rows=10_000,
    role="locked_eval_predictions",
)
serialized = ref.to_dict()
assert serialized["uri"].startswith("s3://")
```

The `role` field is free-form — useful when a manifest carries
predictions for multiple slices or scorers and the renderer needs to
distinguish them.

(metric-state)=
### MetricState
A typed wrapper for metrics that may not be computable: too few rows,
single-class slice, NaN inputs, downstream tool exception. The status
column is `"ok"`, `"skipped"`, or `"error"`.

```python
from eval_toolkit.artifacts import MetricState, error_metric, skipped_metric

# Successful metric:
ok = MetricState(value=0.82, status="ok").to_dict()
assert ok["value"] == 0.82

# A slice that's too small to bootstrap:
skipped = skipped_metric("n_resamples below floor", n=5)
assert skipped["status"] == "skipped"

# A metric whose computation hit a divide-by-zero:
errored = error_metric("ZeroDivisionError", numerator=0, denominator=0)
assert errored["status"] == "error"
```

Consumers reading a metric should pattern-match on `status` before
trusting `value`. The point is to make the *absence* of a number an
explicit, JSON-typed state instead of a `null` that could mean a
dozen different things.

(artifacts-worked-walkthrough)=
## Worked walkthrough
Producing predictions, registering them as an artifact, computing a
bootstrap CI from the artifact, and computing a paired diff against a
second artifact — all without re-running inference.

```python
import csv
import numpy as np
import tempfile
from pathlib import Path

from eval_toolkit.analysis import (
    bootstrap_metric_from_predictions,
    load_prediction_arrays,
    paired_diff_from_prediction_refs,
)
from eval_toolkit.artifacts import PredictionArtifactRef, PredictionColumns

tmpdir = Path(tempfile.mkdtemp())

# Synthetic predictions for two scorers on the same dev rows.
rng = np.random.default_rng(0)
n = 200
labels = rng.integers(0, 2, size=n)
baseline_scores = np.clip(0.3 + 0.4 * labels + rng.normal(0, 0.15, n), 0, 1)
candidate_scores = np.clip(0.2 + 0.6 * labels + rng.normal(0, 0.10, n), 0, 1)


def _write_csv(path: Path, scores: np.ndarray) -> None:
    with path.open("w", newline="") as fh:
        writer = csv.DictWriter(
            fh, fieldnames=["label", "score", "row_id", "content_hash"]
        )
        writer.writeheader()
        for i, (y, s) in enumerate(zip(labels, scores, strict=True)):
            writer.writerow(
                {"label": int(y), "score": float(s), "row_id": f"r{i}", "content_hash": f"h{i}"}
            )


baseline_path = tmpdir / "baseline_preds.csv"
candidate_path = tmpdir / "candidate_preds.csv"
_write_csv(baseline_path, baseline_scores)
_write_csv(candidate_path, candidate_scores)

columns = PredictionColumns(
    label="label", score="score", row_id="row_id", content_hash="content_hash"
)
baseline_ref = PredictionArtifactRef(
    uri=str(baseline_path),
    media_type="text/csv",
    columns=columns,
)
candidate_ref = PredictionArtifactRef(
    uri=str(candidate_path),
    media_type="text/csv",
    columns=columns,
)

# Re-load the predictions later (could be a separate process):
arrays = load_prediction_arrays(baseline_ref.to_dict())
assert arrays.labels.shape == (n,)
assert arrays.scores.shape == (n,)

# Bootstrap PR-AUC CI directly from the on-disk artifact:
ci = bootstrap_metric_from_predictions(baseline_ref.to_dict(), n_resamples=50, seed=1)
assert ci["n_resamples"] == 50
low, high = ci["ci_95"]
assert low <= high

# Paired diff: candidate − baseline (rows align by row_id + content_hash):
diff = paired_diff_from_prediction_refs(
    baseline_ref.to_dict(),
    candidate_ref.to_dict(),
    n_resamples=50,
    seed=1,
)
assert diff["n_resamples"] == 50
```

The diff helper enforces matching `row_id` and `content_hash` between
the two artifacts. Mismatches raise `ValueError` rather than silently
producing a meaningless diff.

(metric-state-vs-error)=
## MetricState vs error_metric helpers
Three flavors of metric reporting, in order of escalation:

1. **`MetricState(value=x, status="ok")`** — the metric was computed.
   Use this directly when you control the metric pipeline.

2. **`skipped_metric(reason, **details)`** — the metric was *not*
   computed for a stated reason. Examples: single-class slice (`PR-AUC
   undefined`), too few rows for bootstrap, missing positives at a
   target FPR. The metric was *expected* to be unavailable.

3. **`error_metric(reason, **details)`** — the metric raised an
   unexpected exception. The metric was *expected* to be computable
   but wasn't. The renderer should highlight these as bugs to
   investigate.

`skipped` is the normal "principled absence." `error` is the
"something is wrong here" state. Don't conflate them — a renderer that
treats all non-`ok` states identically loses important signal.

(validation)=
## The `validation` extra and validate_payload
`validate_payload` runs a payload against a bundled JSON Schema using
the optional `jsonschema` dependency. The dependency is opt-in so the
toolkit's core deps stay at numpy / scipy / sklearn.

```python
# Install: pip install "eval-toolkit[validation]"
from eval_toolkit.artifacts import validate_payload

result_payload = {
    "run_id": "demo",
    "schema_version": "v1",
    "config": {},
    "by_slice": {
        "dev": {"n": 100, "n_positive": 50, "by_scorer": {"model": {"pr_auc": 0.8}}}
    },
}
validate_payload(result_payload, schema_name="results.v1.json")  # raises on bad shape
```

The schemas live at `eval_toolkit/schemas/*.json` and follow the
`additive-fields, additionalProperties: true` policy documented in
[versioning.md § schema-evolution](versioning.md#schema-evolution): old
consumers see new fields as inert; new consumers see old payloads as
valid.

(artifacts-pitfalls)=
## Pitfalls / Common mistakes
**Do include `content_hash` when running paired diffs.** Without it,
`paired_diff_from_prediction_refs` can only check `row_id` equality —
which means a row whose underlying text changed silently between runs
will compare as the same row. The diff's variance estimate is then
meaningless. The fix: hash the input text and persist that hash with
the prediction.

**Don't build artifact dicts inline.** Construct `PredictionColumns`
and `PredictionArtifactRef` instances and serialize via `.to_dict()`.
Hand-written `{"uri": ..., "media_type": ...}` dicts skip the
`__post_init__` validation that catches typos (`"row_ids"` vs
`"row_id"`) and bad shapes (`n_rows` as a `bool`).

**Don't treat `media_type` as decorative.** It routes the prediction
reader selection in `load_prediction_arrays`. The two media types the
toolkit recognizes out of the box are `text/csv` and
`application/jsonl`; other types raise a clear "no built-in reader"
ValueError. Setting `media_type` correctly is the difference between
a working pipeline and a debugging session.

**Don't silently swallow non-finite metrics.** `sanitize_for_json`
converts NaN / +Inf / -Inf to a `skipped_metric(...)` payload so
`json.dumps(allow_nan=False)` succeeds. The downstream cost: a renderer
that doesn't look at `status` thinks the metric is missing. Always
pattern-match on `status` first.

**Don't rely on the `validation` extra in production code paths.**
`validate_payload` raises `ImportError` if `jsonschema` is not
installed. Catch that explicitly in any consumer code that wants to
degrade gracefully when running in environments where the extra isn't
present (e.g., a lightweight inference-only image).

## See also

- [claims.md](claims.md) — the gate layer that reads from these
  artifacts to produce go/no-go verdicts.
- [evidence.md](evidence.md) — broader source-role and claim-mode
  framing.
- [versioning.md § schema-evolution](versioning.md#schema-evolution) —
  the additive-fields contract for the bundled `.v1.json` schemas.
- `eval_toolkit.analysis` module docstring — implementation details
  for the CSV / JSONL prediction readers and the
  `bootstrap_metric_from_predictions` / `paired_diff_from_prediction_refs`
  helpers.