Schema Reference#
Field-by-field reference for the JSON Schemas bundled with eval-toolkit.
The schemas live at src/eval_toolkit/schemas/ and ship inside the
installed wheel — they’re discoverable programmatically and
validatable via the optional [validation] extra or the
eval-toolkit CLI.
Scope. This doc describes what’s in each schema today. For the evolution policy (when fields are added, when
.vNbumps, how consumers should treatadditionalProperties: true), read methodology/versioning.md § schema-evolution. For the data model underlyingprediction_artifactsandclaim_report, read methodology/artifacts.md and methodology/claims.md.
Inventory#
Three schemas. All Draft 2020-12. All
"additionalProperties": true at the top level — old consumers
gracefully ignore new optional fields.
Schema |
Top-level shape |
When produced |
Consumers |
|---|---|---|---|
|
|
|
Report renderers, lightweight downstream parsers |
|
Same as above but |
Same harness call, second file ( |
Replay pipelines, post-hoc bootstrap on the original scores |
|
|
|
Reproducibility audits, NeurIPS-aligned report templates |
results.v1.json#
Required: schema_version, run_id, config, by_slice.
Field |
Type |
Required |
Since |
Semantics |
|---|---|---|---|---|
|
|
yes |
v0.5 |
Stable version label. Bumps on breaking output changes. |
|
string (min length 1) |
yes |
v0.5 |
Caller-supplied unique identifier for the run. |
|
string | null |
optional |
v0.5 |
Captured via |
|
object |
yes |
v0.5 |
Eval-time config: |
|
object of slice-blocks |
yes |
v0.5 |
Per-slice nested results. See below. |
|
object |
optional |
v0.7 |
Per-fold raw |
|
object |
optional |
v0.7 |
CV-CI summary indexed by |
|
object |
optional |
v0.9 |
|
|
array of objects |
optional |
v0.9 |
List of |
|
array of objects |
optional |
v0.9 |
Each entry |
|
object |
optional |
v0.9 |
|
|
object |
optional |
v0.9 |
|
|
object |
optional |
v0.9 |
|
Slice-block shape#
Each by_slice[<slice_name>] is an object with the following
properties (no formal schema enforcement on inner shape — it’s
documented here rather than typed in the JSON Schema because the
metric set varies by toolkit minor version):
n(int) — total rows in the slicen_positive(int) — positive labels in the sliceby_scorer(object) —{scorer_name: per-scorer-block}paired_diffs(object) —{(baseline, candidate): diff-payload}, populated only whenpaired_diffs=[(a, b)]was passed toevaluate
Each per-scorer block carries:
pr_auc,roc_auc,brier_score(floats)pr_auc_ci,roc_auc_ci,brier_score_ci(BootstrapCIdicts:{point_estimate, ci_95: [low, high], confidence, n_resamples, method})ece,ece_equal_mass,ece_equal_width,ece_equal_mass_error(calibration errors)precision_at_prior(float)operating_points(object) — selector-keyed threshold metadatatransferred_operating_points(object, optional) — populated whenOperatingPointSpecwas usedscores— stripped fromresults.v1, retained inresults_full.v1is_single_class(bool) — true when the slice has only one label class (PR-AUC undefined; metrics may be skipped)
results_full.v1.json#
Same top-level shape as results.v1.json. The only difference is
that by_slice[*].by_scorer[*].scores arrays are retained, enabling
re-computation of metrics or bootstrap CIs from the raw scores
without re-running inference.
Use this variant when you want full replay capability. Use the compact variant when you only need the headline metrics.
manifest.v1.json#
Required: schema_version, run_id, code_versions, env.
Field |
Type |
Required |
Since |
Semantics |
|---|---|---|---|---|
|
|
yes |
v0.5 |
Stable version. |
|
string |
yes |
v0.5 |
Matches |
|
string | null |
optional |
v0.5 |
Captured at run time. |
|
bool |
optional |
v0.5 |
True iff |
|
|
yes |
v0.5 |
At minimum |
|
|
optional |
v0.5 |
Global / bootstrap / torch / dataloader RNG seeds. |
|
|
optional |
v0.5 |
SHA-256 of each input artifact (CSV, JSONL, Parquet, …). |
|
string |
optional |
v0.5 |
SHA-256 over canonical-JSON encoding of the eval config. |
|
object |
yes |
v0.5 |
Environment fingerprint: |
|
object |
optional |
v0.5 |
|
|
string | null |
optional |
v0.5 |
Reported by |
|
number | null |
optional |
v0.5 |
Optional run duration. |
|
|
optional |
v0.7 |
Auto-collected from Tier-2 implementations exposing |
|
object | null |
optional |
v0.7 |
|
|
array of |
optional |
v0.8 |
Each entry: |
|
array of strings |
optional |
v0.8 |
Predeclared free-form guardrails (e.g., |
|
array of |
optional |
v0.9 |
|
leakage_report shape#
When present, conforms to:
leakage_report.findings: [
{
"check_name": "exact_duplicate" | "near_duplicate" | ...,
"severity": "error" | "warning" | "info",
"drop_indices": object,
"evidence": object,
"message": string,
"n_affected": int (≥ 0)
},
...
]
source_roles shape#
source_roles: [
{
"source": "main_train",
"role": "train",
"n_rows": 50000,
"notes": "v0.8 train pool",
"metadata": {"data_card_url": "..."}
},
...
]
Recommended roles: train, calibration, locked_eval,
external_diagnostic, excluded. Custom roles are allowed
(additionalProperties: true on each entry).
Versioning + forward compatibility#
All three schemas are at .v1. The filename includes the version
suffix (results.v1.json), and the $id URL encodes it too. A
future incompatible change ships at .v2 (results.v2.json) and
leaves .v1 in the wheel so both versions remain validatable
side by side.
Within a major schema version, the rule is additive only:
New optional fields can appear at any time.
"additionalProperties": trueis set at the top level (and on nested object schemas where reasonable), so old consumers see new fields as inert.Removing a field, renaming a field, narrowing an enum, tightening a
requiredarray, or changing a type are all breaking changes that trigger a.vN+1bump.
See methodology/versioning.md § schema-evolution for the full policy and concrete v0.7→v0.9 case studies.
Programmatic discovery#
from pathlib import Path
import eval_toolkit
schemas_dir = Path(eval_toolkit.__file__).parent / "schemas"
names = sorted(p.stem for p in schemas_dir.glob("*.json"))
assert "results.v1" in names
assert "manifest.v1" in names
Or from the CLI (v0.10.0+):
eval-toolkit schemas list
# → manifest.v1
# → results.v1
# → results_full.v1
eval-toolkit schemas show results.v1
# Pretty-prints the schema as JSON
Validation example#
# Requires: pip install "eval-toolkit[validation]"
from eval_toolkit.artifacts import validate_payload
payload = {
"schema_version": "v1",
"run_id": "demo",
"config": {"n_resamples": 100},
"by_slice": {
"dev": {
"n": 100,
"n_positive": 50,
"by_scorer": {"model": {"pr_auc": 0.82}},
}
},
}
# No-op on success; raises jsonschema.ValidationError on a bad shape.
validate_payload(payload, schema_name="results.v1.json")
Or from the CLI:
eval-toolkit validate run_dir/results.json results.v1
# → run_dir/results.json: OK against results.v1
Exit codes:
0— valid1— schema validation failed2— file or schema not found3—[validation]extra not installed
See also#
methodology/versioning.md § schema-evolution — the policy that governs when fields are added vs when
.vNbumps.methodology/artifacts.md — the
PredictionArtifactRef/PredictionColumnscontract that theprediction_artifactsfield references.methodology/claims.md — the
ClaimReportcontract that theclaim_reportfield references.getting-started.md § Validate the JSON — first-touch validation workflow.