Prediction Artifacts and Metric States#
This chapter covers eval_toolkit.artifacts — the layer that lets you
separate predictions from the metrics computed on them. The toolkit
exposes three contracts: PredictionArtifactRef (manifest pointer to
on-disk predictions), PredictionColumns (column-mapping schema), and
MetricState (typed status wrapper for metrics that may be skipped or
errored). Together they answer a single question: can a downstream
consumer replay this evaluation without re-running inference?
Background. This chapter assumes you’ve produced a
RunResult(see getting-started) and have either a CSV, JSONL, or Parquet file of per-row predictions. The methodology curriculum covers the metric kernels (API: metrics), bootstrap CI shapes (bootstrap.md), and the broader claim model (claims.md). What’s not covered elsewhere is the data contract that ties them together: how the bytes of apredictions.csvconnect to a typed reference inmanifest.jsonand the optional schema-validated payload that ships next to it.
Why a separate artifact layer?#
Computing metrics from predictions in-memory is cheap. The problem appears later, when a stakeholder asks:
“What were the actual scores for the rows that landed in the FPR-low region of the curve?”
“Did the v0.8 release use the same eval predictions as the v0.9 release, or did the scorer drift?”
“Can we re-compute PR-AUC under a new bootstrap method without re-running inference on 50K rows?”
The fix is to persist predictions as a first-class artifact with a
stable on-disk contract. That artifact then carries enough metadata
(uri, media_type, columns, sha256, n_rows) for any future
analysis pipeline to load, validate, and recompute against it.
The data model#
Three frozen dataclasses make up the public contract:
PredictionColumns#
The column-mapping schema. Required: label and score. Optional but
strongly recommended for paired-diff workflows: row_id and
content_hash.
from eval_toolkit.artifacts import PredictionColumns
columns = PredictionColumns(
label="y_true",
score="prob_positive",
row_id="example_id",
content_hash="text_sha256",
)
assert columns.label == "y_true"
row_id lets downstream code align predictions across two runs of
the same scorer for paired bootstrap diffs. content_hash detects
silent label-drift or text-mutation: same row_id + different
content_hash means the row is not the same row.
PredictionArtifactRef#
The manifest-level pointer to a persisted prediction file. Carries the
URI, MIME-style media type, the column mapping, plus integrity
metadata (sha256, n_rows).
from eval_toolkit.artifacts import PredictionArtifactRef, PredictionColumns
ref = PredictionArtifactRef(
uri="s3://my-eval-bucket/run-42/predictions.csv",
media_type="text/csv",
columns=PredictionColumns(label="y", score="s"),
sha256="a" * 64,
n_rows=10_000,
role="locked_eval_predictions",
)
serialized = ref.to_dict()
assert serialized["uri"].startswith("s3://")
The role field is free-form — useful when a manifest carries
predictions for multiple slices or scorers and the renderer needs to
distinguish them.
MetricState#
A typed wrapper for metrics that may not be computable: too few rows,
single-class slice, NaN inputs, downstream tool exception. The status
column is "ok", "skipped", or "error".
from eval_toolkit.artifacts import MetricState, error_metric, skipped_metric
# Successful metric:
ok = MetricState(value=0.82, status="ok").to_dict()
assert ok["value"] == 0.82
# A slice that's too small to bootstrap:
skipped = skipped_metric("n_resamples below floor", n=5)
assert skipped["status"] == "skipped"
# A metric whose computation hit a divide-by-zero:
errored = error_metric("ZeroDivisionError", numerator=0, denominator=0)
assert errored["status"] == "error"
Consumers reading a metric should pattern-match on status before
trusting value. The point is to make the absence of a number an
explicit, JSON-typed state instead of a null that could mean a
dozen different things.
Worked walkthrough#
Producing predictions, registering them as an artifact, computing a bootstrap CI from the artifact, and computing a paired diff against a second artifact — all without re-running inference.
import csv
import numpy as np
import tempfile
from pathlib import Path
from eval_toolkit.analysis import (
bootstrap_metric_from_predictions,
load_prediction_arrays,
paired_diff_from_prediction_refs,
)
from eval_toolkit.artifacts import PredictionArtifactRef, PredictionColumns
tmpdir = Path(tempfile.mkdtemp())
# Synthetic predictions for two scorers on the same dev rows.
rng = np.random.default_rng(0)
n = 200
labels = rng.integers(0, 2, size=n)
baseline_scores = np.clip(0.3 + 0.4 * labels + rng.normal(0, 0.15, n), 0, 1)
candidate_scores = np.clip(0.2 + 0.6 * labels + rng.normal(0, 0.10, n), 0, 1)
def _write_csv(path: Path, scores: np.ndarray) -> None:
with path.open("w", newline="") as fh:
writer = csv.DictWriter(
fh, fieldnames=["label", "score", "row_id", "content_hash"]
)
writer.writeheader()
for i, (y, s) in enumerate(zip(labels, scores, strict=True)):
writer.writerow(
{"label": int(y), "score": float(s), "row_id": f"r{i}", "content_hash": f"h{i}"}
)
baseline_path = tmpdir / "baseline_preds.csv"
candidate_path = tmpdir / "candidate_preds.csv"
_write_csv(baseline_path, baseline_scores)
_write_csv(candidate_path, candidate_scores)
columns = PredictionColumns(
label="label", score="score", row_id="row_id", content_hash="content_hash"
)
baseline_ref = PredictionArtifactRef(
uri=str(baseline_path),
media_type="text/csv",
columns=columns,
)
candidate_ref = PredictionArtifactRef(
uri=str(candidate_path),
media_type="text/csv",
columns=columns,
)
# Re-load the predictions later (could be a separate process):
arrays = load_prediction_arrays(baseline_ref.to_dict())
assert arrays.labels.shape == (n,)
assert arrays.scores.shape == (n,)
# Bootstrap PR-AUC CI directly from the on-disk artifact:
ci = bootstrap_metric_from_predictions(baseline_ref.to_dict(), n_resamples=50, seed=1)
assert ci["n_resamples"] == 50
low, high = ci["ci_95"]
assert low <= high
# Paired diff: candidate − baseline (rows align by row_id + content_hash):
diff = paired_diff_from_prediction_refs(
baseline_ref.to_dict(),
candidate_ref.to_dict(),
n_resamples=50,
seed=1,
)
assert diff["n_resamples"] == 50
The diff helper enforces matching row_id and content_hash between
the two artifacts. Mismatches raise ValueError rather than silently
producing a meaningless diff.
MetricState vs error_metric helpers#
Three flavors of metric reporting, in order of escalation:
MetricState(value=x, status="ok")— the metric was computed. Use this directly when you control the metric pipeline.skipped_metric(reason, **details)— the metric was not computed for a stated reason. Examples: single-class slice (PR-AUC undefined), too few rows for bootstrap, missing positives at a target FPR. The metric was expected to be unavailable.error_metric(reason, **details)— the metric raised an unexpected exception. The metric was expected to be computable but wasn’t. The renderer should highlight these as bugs to investigate.
skipped is the normal “principled absence.” error is the
“something is wrong here” state. Don’t conflate them — a renderer that
treats all non-ok states identically loses important signal.
The validation extra and validate_payload#
validate_payload runs a payload against a bundled JSON Schema using
the optional jsonschema dependency. The dependency is opt-in so the
toolkit’s core deps stay at numpy / scipy / sklearn.
# Install: pip install "eval-toolkit[validation]"
from eval_toolkit.artifacts import validate_payload
result_payload = {
"run_id": "demo",
"schema_version": "v1",
"config": {},
"by_slice": {
"dev": {"n": 100, "n_positive": 50, "by_scorer": {"model": {"pr_auc": 0.8}}}
},
}
validate_payload(result_payload, schema_name="results.v1.json") # raises on bad shape
The schemas live at eval_toolkit/schemas/*.json and follow the
additive-fields, additionalProperties: true policy documented in
versioning.md § schema-evolution: old
consumers see new fields as inert; new consumers see old payloads as
valid.
Pitfalls / Common mistakes#
Do include content_hash when running paired diffs. Without it,
paired_diff_from_prediction_refs can only check row_id equality —
which means a row whose underlying text changed silently between runs
will compare as the same row. The diff’s variance estimate is then
meaningless. The fix: hash the input text and persist that hash with
the prediction.
Don’t build artifact dicts inline. Construct PredictionColumns
and PredictionArtifactRef instances and serialize via .to_dict().
Hand-written {"uri": ..., "media_type": ...} dicts skip the
__post_init__ validation that catches typos ("row_ids" vs
"row_id") and bad shapes (n_rows as a bool).
Don’t treat media_type as decorative. It routes the prediction
reader selection in load_prediction_arrays. The two media types the
toolkit recognizes out of the box are text/csv and
application/jsonl; other types raise a clear “no built-in reader”
ValueError. Setting media_type correctly is the difference between
a working pipeline and a debugging session.
Don’t silently swallow non-finite metrics. sanitize_for_json
converts NaN / +Inf / -Inf to a skipped_metric(...) payload so
json.dumps(allow_nan=False) succeeds. The downstream cost: a renderer
that doesn’t look at status thinks the metric is missing. Always
pattern-match on status first.
Don’t rely on the validation extra in production code paths.
validate_payload raises ImportError if jsonschema is not
installed. Catch that explicitly in any consumer code that wants to
degrade gracefully when running in environments where the extra isn’t
present (e.g., a lightweight inference-only image).
See also#
claims.md — the gate layer that reads from these artifacts to produce go/no-go verdicts.
evidence.md — broader source-role and claim-mode framing.
versioning.md § schema-evolution — the additive-fields contract for the bundled
.v1.jsonschemas.eval_toolkit.analysismodule docstring — implementation details for the CSV / JSONL prediction readers and thebootstrap_metric_from_predictions/paired_diff_from_prediction_refshelpers.