Manifest schema

Public-site note. This is a technical provenance reference for people inspecting run artifacts. It is not a required first read.

Every canonical evaluation run produces a manifest.json capturing provenance. The schema is owned upstream by eval-toolkit and has evolved across versions:

manifest.v1.json — original NeurIPS-aligned schema (git_sha, code_versions, env, gpu_info, seeds, data_hashes, config_hash, leakage_report, versioned_objects, source_roles, guardrails, prediction_artifacts)
manifest.v2.json — extends v1 with additional fields (eval-toolkit v0.16+)
manifest.v3.json — current default as of eval-toolkit v0.23.0; adds required contamination_flags field (per-scorer three-state taxonomy), captured_at ISO-8601 timestamp, data_revisions (HuggingFace dataset + model commit SHAs), and metadata

This file documents the expected fields so a reviewer inspecting evals/<run>/manifest.json knows what’s there and why each field exists.

[LOCKED] Schema version: manifest.v3.json per ADR-057. v3 is the current eval-toolkit upstream default (v0.23.0+). Project consumes the upstream schema unchanged; no project-local extension fields.

Required fields

Field	Type	Purpose
`captured_at`	ISO 8601 timestamp	When the run completed
`git_sha`	string	Repo commit at run time
`config_hash`	string (sha256)	Hash of the trainer / eval config — ensures hyperparameter immutability invariant holds
`data_hashes`	map(str → sha256)	One entry per output artifact: `metrics_full.csv`, `predictions_full.parquet`, per-seed parquets, etc.
`data_revisions`	map(str → commit)	HuggingFace dataset + model commit SHAs at fetch time
`contamination_flags`	map(str → enum)	Per reference scorer, the three-state taxonomy verdict: `verified_disjoint` / `suspected_contamination` / `vendor_black_box`
`guardrails`	list(audit-record)	Outcomes of locked audit rules: leakage checks, schema validation, etc. Each record carries audit name + detail + location + n_pairs + severity
`code_versions`	map(str → version)	eval-toolkit version + any other load-bearing library versions
`env`	map(str → version)	Python, numpy, pandas, scipy, sklearn, torch, transformers, etc.
`gpu_info`	map(str → val)	`count`, `memory_gb`, `name`

Optional fields

Field	Type	Purpose
`cuda_version`	string	CUDA toolkit version at run time
`dirty_flag`	bool	True if the repo had uncommitted changes at capture (informational; canonical runs should always be clean)

Cross-references

Locked by: SPEC_GREENFIELD.md decision ledger §6 row “Per-row prediction persistence” (locked); §Tech-Stack rows “library version pins” (Phase 0 fills exact eval-toolkit version)
Verified by: Phase 4 evaluation outputs at evals/<run>/manifest.json; the tests/test_invariants.py:test_hyperparameter_immutability stub will check config_hash matches the committed config
Reviewer entry: docs/REPRODUCIBILITY.md references this schema for the fresh-clone recipe
Contamination taxonomy mapping: docs/THREAT_MODEL.md reference-scorer audit section; EVIDENCE.md §1-2 per-scorer verdicts

Why this schema

Reviewers will inspect manifests to verify reproducibility claims. Documenting the expected field list at seed time prevents guessing later. The schema is upstream-owned (eval-toolkit publishes the canonical JSON Schema); this project follows the upstream contract rather than inventing its own provenance format.