Prediction Artifacts and Metric States#

This chapter covers eval_toolkit.artifacts — the layer that lets you separate predictions from the metrics computed on them. The toolkit exposes three contracts: PredictionArtifactRef (manifest pointer to on-disk predictions), PredictionColumns (column-mapping schema), and MetricState (typed status wrapper for metrics that may be skipped or errored). Together they answer a single question: can a downstream consumer replay this evaluation without re-running inference?

Background. This chapter assumes you’ve produced a RunResult (see getting-started) and have either a CSV, JSONL, or Parquet file of per-row predictions. The methodology curriculum covers the metric kernels (API: metrics), bootstrap CI shapes (bootstrap.md), and the broader claim model (claims.md). What’s not covered elsewhere is the data contract that ties them together: how the bytes of a predictions.csv connect to a typed reference in manifest.json and the optional schema-validated payload that ships next to it.

Why a separate artifact layer?#

Computing metrics from predictions in-memory is cheap. The problem appears later, when a stakeholder asks:

“What were the actual scores for the rows that landed in the FPR-low region of the curve?”
“Did the v0.8 release use the same eval predictions as the v0.9 release, or did the scorer drift?”
“Can we re-compute PR-AUC under a new bootstrap method without re-running inference on 50K rows?”

The fix is to persist predictions as a first-class artifact with a stable on-disk contract. That artifact then carries enough metadata (uri, media_type, columns, sha256, n_rows) for any future analysis pipeline to load, validate, and recompute against it.

The data model#

Three frozen dataclasses make up the public contract:

PredictionColumns#

The column-mapping schema. Required: label and score. Optional but strongly recommended for paired-diff workflows: row_id and content_hash.

from eval_toolkit.artifacts import PredictionColumns

columns = PredictionColumns(
    label="y_true",
    score="prob_positive",
    row_id="example_id",
    content_hash="text_sha256",
)
assert columns.label == "y_true"

row_id lets downstream code align predictions across two runs of the same scorer for paired bootstrap diffs. content_hash detects silent label-drift or text-mutation: same row_id + different content_hash means the row is not the same row.

PredictionArtifactRef#

The manifest-level pointer to a persisted prediction file. Carries the URI, MIME-style media type, the column mapping, plus integrity metadata (sha256, n_rows).

from eval_toolkit.artifacts import PredictionArtifactRef, PredictionColumns

ref = PredictionArtifactRef(
    uri="s3://my-eval-bucket/run-42/predictions.csv",
    media_type="text/csv",
    columns=PredictionColumns(label="y", score="s"),
    sha256="a" * 64,
    n_rows=10_000,
    role="locked_eval_predictions",
)
serialized = ref.to_dict()
assert serialized["uri"].startswith("s3://")

The role field is free-form — useful when a manifest carries predictions for multiple slices or scorers and the renderer needs to distinguish them.

MetricState#

A typed wrapper for metrics that may not be computable: too few rows, single-class slice, NaN inputs, downstream tool exception. The status column is "ok", "skipped", or "error".

from eval_toolkit.artifacts import MetricState, error_metric, skipped_metric

# Successful metric:
ok = MetricState(value=0.82, status="ok").to_dict()
assert ok["value"] == 0.82

# A slice that's too small to bootstrap:
skipped = skipped_metric("n_resamples below floor", n=5)
assert skipped["status"] == "skipped"

# A metric whose computation hit a divide-by-zero:
errored = error_metric("ZeroDivisionError", numerator=0, denominator=0)
assert errored["status"] == "error"

Consumers reading a metric should pattern-match on status before trusting value. The point is to make the absence of a number an explicit, JSON-typed state instead of a null that could mean a dozen different things.

Worked walkthrough#

Producing predictions, registering them as an artifact, computing a bootstrap CI from the artifact, and computing a paired diff against a second artifact — all without re-running inference.

import csv
import numpy as np
import tempfile
from pathlib import Path

from eval_toolkit.analysis import (
    bootstrap_metric_from_predictions,
    load_prediction_arrays,
    paired_diff_from_prediction_refs,
)
from eval_toolkit.artifacts import PredictionArtifactRef, PredictionColumns

tmpdir = Path(tempfile.mkdtemp())

# Synthetic predictions for two scorers on the same dev rows.
rng = np.random.default_rng(0)
n = 200
labels = rng.integers(0, 2, size=n)
baseline_scores = np.clip(0.3 + 0.4 * labels + rng.normal(0, 0.15, n), 0, 1)
candidate_scores = np.clip(0.2 + 0.6 * labels + rng.normal(0, 0.10, n), 0, 1)


def _write_csv(path: Path, scores: np.ndarray) -> None:
    with path.open("w", newline="") as fh:
        writer = csv.DictWriter(
            fh, fieldnames=["label", "score", "row_id", "content_hash"]
        )
        writer.writeheader()
        for i, (y, s) in enumerate(zip(labels, scores, strict=True)):
            writer.writerow(
                {"label": int(y), "score": float(s), "row_id": f"r{i}", "content_hash": f"h{i}"}
            )


baseline_path = tmpdir / "baseline_preds.csv"
candidate_path = tmpdir / "candidate_preds.csv"
_write_csv(baseline_path, baseline_scores)
_write_csv(candidate_path, candidate_scores)

columns = PredictionColumns(
    label="label", score="score", row_id="row_id", content_hash="content_hash"
)
baseline_ref = PredictionArtifactRef(
    uri=str(baseline_path),
    media_type="text/csv",
    columns=columns,
)
candidate_ref = PredictionArtifactRef(
    uri=str(candidate_path),
    media_type="text/csv",
    columns=columns,
)

# Re-load the predictions later (could be a separate process):
arrays = load_prediction_arrays(baseline_ref.to_dict())
assert arrays.labels.shape == (n,)
assert arrays.scores.shape == (n,)

# Bootstrap PR-AUC CI directly from the on-disk artifact:
ci = bootstrap_metric_from_predictions(baseline_ref.to_dict(), n_resamples=50, seed=1)
assert ci["n_resamples"] == 50
low, high = ci["ci_95"]
assert low <= high

# Paired diff: candidate − baseline (rows align by row_id + content_hash):
diff = paired_diff_from_prediction_refs(
    baseline_ref.to_dict(),
    candidate_ref.to_dict(),
    n_resamples=50,
    seed=1,
)
assert diff["n_resamples"] == 50

The diff helper enforces matching row_id and content_hash between the two artifacts. Mismatches raise ValueError rather than silently producing a meaningless diff.

MetricState vs error_metric helpers#

Three flavors of metric reporting, in order of escalation:

MetricState(value=x, status="ok") — the metric was computed. Use this directly when you control the metric pipeline.
skipped_metric(reason, **details) — the metric was not computed for a stated reason. Examples: single-class slice (PR-AUC undefined), too few rows for bootstrap, missing positives at a target FPR. The metric was expected to be unavailable.
error_metric(reason, **details) — the metric raised an unexpected exception. The metric was expected to be computable but wasn’t. The renderer should highlight these as bugs to investigate.

skipped is the normal “principled absence.” error is the “something is wrong here” state. Don’t conflate them — a renderer that treats all non-ok states identically loses important signal.

The `validation` extra and validate_payload#

validate_payload runs a payload against a bundled JSON Schema using the optional jsonschema dependency. The dependency is opt-in so the toolkit’s core deps stay at numpy / scipy / sklearn.

# Install: pip install "eval-toolkit[validation]"
from eval_toolkit.artifacts import validate_payload

result_payload = {
    "run_id": "demo",
    "schema_version": "v1",
    "config": {},
    "by_slice": {
        "dev": {"n": 100, "n_positive": 50, "by_scorer": {"model": {"pr_auc": 0.8}}}
    },
}
validate_payload(result_payload, schema_name="results.v1.json")  # raises on bad shape

The schemas live at eval_toolkit/schemas/*.json and follow the additive-fields, additionalProperties: true policy documented in versioning.md § schema-evolution: old consumers see new fields as inert; new consumers see old payloads as valid.

Pitfalls / Common mistakes#

Do include content_hash when running paired diffs. Without it, paired_diff_from_prediction_refs can only check row_id equality — which means a row whose underlying text changed silently between runs will compare as the same row. The diff’s variance estimate is then meaningless. The fix: hash the input text and persist that hash with the prediction.

Don’t build artifact dicts inline. Construct PredictionColumns and PredictionArtifactRef instances and serialize via .to_dict(). Hand-written {"uri": ..., "media_type": ...} dicts skip the __post_init__ validation that catches typos ("row_ids" vs "row_id") and bad shapes (n_rows as a bool).

Don’t treat media_type as decorative. It routes the prediction reader selection in load_prediction_arrays. The two media types the toolkit recognizes out of the box are text/csv and application/jsonl; other types raise a clear “no built-in reader” ValueError. Setting media_type correctly is the difference between a working pipeline and a debugging session.

Don’t silently swallow non-finite metrics. sanitize_for_json converts NaN / +Inf / -Inf to a skipped_metric(...) payload so json.dumps(allow_nan=False) succeeds. The downstream cost: a renderer that doesn’t look at status thinks the metric is missing. Always pattern-match on status first.

Don’t rely on the validation extra in production code paths. validate_payload raises ImportError if jsonschema is not installed. Catch that explicitly in any consumer code that wants to degrade gracefully when running in environments where the extra isn’t present (e.g., a lightweight inference-only image).