Schema Reference#

Field-by-field reference for the JSON Schemas bundled with eval-toolkit. The schemas live at src/eval_toolkit/schemas/ and ship inside the installed wheel — they’re discoverable programmatically and validatable via the optional [validation] extra or the eval-toolkit CLI.

Scope. This doc describes what’s in each schema today. For the evolution policy (when fields are added, when .vN bumps, how consumers should treat additionalProperties: true), read methodology/versioning.md § schema-evolution. For the data model underlying prediction_artifacts and claim_report, read methodology/artifacts.md and methodology/claims.md.

Inventory#

Three schemas. All Draft 2020-12. All "additionalProperties": true at the top level — old consumers gracefully ignore new optional fields.

Schema

Top-level shape

When produced

Consumers

results.v1.json

RunResult with per-row scores stripped

eval_toolkit.harness.write_run_result(...) writes run_dir/results.json

Report renderers, lightweight downstream parsers

results_full.v1.json

Same as above but by_slice[*].by_scorer[*].scores arrays retained

Same harness call, second file (results_full.json)

Replay pipelines, post-hoc bootstrap on the original scores

manifest.v1.json

RunManifest — env / seeds / hashes / source roles

eval_toolkit.manifest.write_manifest(...) writes run_dir/manifest.json

Reproducibility audits, NeurIPS-aligned report templates

results.v1.json#

Required: schema_version, run_id, config, by_slice.

Field

Type

Required

Since

Semantics

schema_version

"v1" (const)

yes

v0.5

Stable version label. Bumps on breaking output changes.

run_id

string (min length 1)

yes

v0.5

Caller-supplied unique identifier for the run.

git_sha

string | null

optional

v0.5

Captured via capture_git_sha(). null if not in a git repo or git is unavailable.

config

object

yes

v0.5

Eval-time config: n_resamples, seed, scorers, slices, paired_diffs, on_scorer_error, optional leakage_report. Free-form within the toolkit’s contract.

by_slice

object of slice-blocks

yes

v0.5

Per-slice nested results. See below.

by_fold

object

optional

v0.7

Per-fold raw RunResult payloads when evaluate_folded(...) is used. Empty for single-fold runs.

fold_summary

object

optional

v0.7

CV-CI summary indexed by [slice][scorer][metric]{mean, ci_low, ci_high, n_folds}. Empty for non-folded runs.

claim_report

object

optional

v0.9

ClaimReport.to_dict() payload from evaluate_claims(...).

prediction_artifacts

array of objects

optional

v0.9

List of PredictionArtifactRef.to_dict() entries.

evidence_axes

array of objects

optional

v0.9

Each entry {name, value} — see EvidenceAxis.

pairing_metadata

object

optional

v0.9

PairingMetadata.to_dict() payload.

aggregate_evidence

object

optional

v0.9

{status: "inferential"/"descriptive"/"diagnostic"/"unsupported", ...}.

threshold_policy

object

optional

v0.9

ThresholdPolicyMetadata.to_dict() payload — calibration slice, score column, selector, constraints, claim-enabled flag.

Slice-block shape#

Each by_slice[<slice_name>] is an object with the following properties (no formal schema enforcement on inner shape — it’s documented here rather than typed in the JSON Schema because the metric set varies by toolkit minor version):

  • n (int) — total rows in the slice

  • n_positive (int) — positive labels in the slice

  • by_scorer (object) — {scorer_name: per-scorer-block}

  • paired_diffs (object) — {(baseline, candidate): diff-payload}, populated only when paired_diffs=[(a, b)] was passed to evaluate

Each per-scorer block carries:

  • pr_auc, roc_auc, brier_score (floats)

  • pr_auc_ci, roc_auc_ci, brier_score_ci (BootstrapCI dicts: {point_estimate, ci_95: [low, high], confidence, n_resamples, method})

  • ece, ece_equal_mass, ece_equal_width, ece_equal_mass_error (calibration errors)

  • precision_at_prior (float)

  • operating_points (object) — selector-keyed threshold metadata

  • transferred_operating_points (object, optional) — populated when OperatingPointSpec was used

  • scoresstripped from results.v1, retained in results_full.v1

  • is_single_class (bool) — true when the slice has only one label class (PR-AUC undefined; metrics may be skipped)

results_full.v1.json#

Same top-level shape as results.v1.json. The only difference is that by_slice[*].by_scorer[*].scores arrays are retained, enabling re-computation of metrics or bootstrap CIs from the raw scores without re-running inference.

Use this variant when you want full replay capability. Use the compact variant when you only need the headline metrics.

manifest.v1.json#

Required: schema_version, run_id, code_versions, env.

Field

Type

Required

Since

Semantics

schema_version

"v1" (const)

yes

v0.5

Stable version.

run_id

string

yes

v0.5

Matches results.run_id.

git_sha

string | null

optional

v0.5

Captured at run time.

dirty_flag

bool

optional

v0.5

True iff git status --porcelain had output. NeurIPS clean-replay concern.

code_versions

{package: version}

yes

v0.5

At minimum {eval_toolkit: "x.y.z"}. Use Versioned-implementing objects to enrich.

seeds

{source: seed}

optional

v0.5

Global / bootstrap / torch / dataloader RNG seeds.

data_hashes

{name: "sha256:..."}

optional

v0.5

SHA-256 of each input artifact (CSV, JSONL, Parquet, …).

config_hash

string

optional

v0.5

SHA-256 over canonical-JSON encoding of the eval config.

env

object

yes

v0.5

Environment fingerprint: python, platform, key dep versions.

gpu_info

object

optional

v0.5

{name, count, memory_gb} from nvidia-smi. Empty when unavailable.

cuda_version

string | null

optional

v0.5

Reported by nvidia-smi.

wall_clock_seconds

number | null

optional

v0.5

Optional run duration.

versioned_objects

{name: version}

optional

v0.7

Auto-collected from Tier-2 implementations exposing version (see methodology/versioning.md).

leakage_report

object | null

optional

v0.7

LeakageReport.to_dict() payload from run_leakage_checks(...).

source_roles

array of SourceRoleRecord dicts

optional

v0.8

Each entry: {source, role, n_rows?, notes?, metadata?}.

guardrails

array of strings

optional

v0.8

Predeclared free-form guardrails (e.g., ["no threshold tuning on locked_holdout"]).

prediction_artifacts

array of PredictionArtifactRef dicts

optional

v0.9

{uri, media_type, columns, sha256?, n_rows?, role?, metadata?} — see methodology/artifacts.md.

leakage_report shape#

When present, conforms to:

leakage_report.findings: [
  {
    "check_name": "exact_duplicate" | "near_duplicate" | ...,
    "severity": "error" | "warning" | "info",
    "drop_indices": object,
    "evidence": object,
    "message": string,
    "n_affected": int (≥ 0)
  },
  ...
]

source_roles shape#

source_roles: [
  {
    "source": "main_train",
    "role": "train",
    "n_rows": 50000,
    "notes": "v0.8 train pool",
    "metadata": {"data_card_url": "..."}
  },
  ...
]

Recommended roles: train, calibration, locked_eval, external_diagnostic, excluded. Custom roles are allowed (additionalProperties: true on each entry).

Versioning + forward compatibility#

All three schemas are at .v1. The filename includes the version suffix (results.v1.json), and the $id URL encodes it too. A future incompatible change ships at .v2 (results.v2.json) and leaves .v1 in the wheel so both versions remain validatable side by side.

Within a major schema version, the rule is additive only:

  • New optional fields can appear at any time.

  • "additionalProperties": true is set at the top level (and on nested object schemas where reasonable), so old consumers see new fields as inert.

  • Removing a field, renaming a field, narrowing an enum, tightening a required array, or changing a type are all breaking changes that trigger a .vN+1 bump.

See methodology/versioning.md § schema-evolution for the full policy and concrete v0.7→v0.9 case studies.

Programmatic discovery#

from pathlib import Path
import eval_toolkit


schemas_dir = Path(eval_toolkit.__file__).parent / "schemas"
names = sorted(p.stem for p in schemas_dir.glob("*.json"))
assert "results.v1" in names
assert "manifest.v1" in names

Or from the CLI (v0.10.0+):

eval-toolkit schemas list
# → manifest.v1
# → results.v1
# → results_full.v1

eval-toolkit schemas show results.v1
# Pretty-prints the schema as JSON

Validation example#

# Requires: pip install "eval-toolkit[validation]"
from eval_toolkit.artifacts import validate_payload

payload = {
    "schema_version": "v1",
    "run_id": "demo",
    "config": {"n_resamples": 100},
    "by_slice": {
        "dev": {
            "n": 100,
            "n_positive": 50,
            "by_scorer": {"model": {"pr_auc": 0.82}},
        }
    },
}

# No-op on success; raises jsonschema.ValidationError on a bad shape.
validate_payload(payload, schema_name="results.v1.json")

Or from the CLI:

eval-toolkit validate run_dir/results.json results.v1
# → run_dir/results.json: OK against results.v1

Exit codes:

  • 0 — valid

  • 1 — schema validation failed

  • 2 — file or schema not found

  • 3[validation] extra not installed

See also#