Manifest schema

Public-site note. This is a technical provenance reference for people inspecting run artifacts. It is not a required first read.

Every canonical evaluation run produces a manifest.json capturing provenance. The schema is owned upstream by eval-toolkit and has evolved across versions:

  • manifest.v1.json — original NeurIPS-aligned schema (git_sha, code_versions, env, gpu_info, seeds, data_hashes, config_hash, leakage_report, versioned_objects, source_roles, guardrails, prediction_artifacts)
  • manifest.v2.json — extends v1 with additional fields (eval-toolkit v0.16+)
  • manifest.v3.jsoncurrent default as of eval-toolkit v0.23.0; adds required contamination_flags field (per-scorer three-state taxonomy), captured_at ISO-8601 timestamp, data_revisions (HuggingFace dataset + model commit SHAs), and metadata

This file documents the expected fields so a reviewer inspecting evals/<run>/manifest.json knows what’s there and why each field exists.

[LOCKED] Schema version: manifest.v3.json per ADR-057. v3 is the current eval-toolkit upstream default (v0.23.0+). Project consumes the upstream schema unchanged; no project-local extension fields.

Required fields

Field Type Purpose
captured_at ISO 8601 timestamp When the run completed
git_sha string Repo commit at run time
config_hash string (sha256) Hash of the trainer / eval config — ensures hyperparameter immutability invariant holds
data_hashes map(str → sha256) One entry per output artifact: metrics_full.csv, predictions_full.parquet, per-seed parquets, etc.
data_revisions map(str → commit) HuggingFace dataset + model commit SHAs at fetch time
contamination_flags map(str → enum) Per reference scorer, the three-state taxonomy verdict: verified_disjoint / suspected_contamination / vendor_black_box
guardrails list(audit-record) Outcomes of locked audit rules: leakage checks, schema validation, etc. Each record carries audit name + detail + location + n_pairs + severity
code_versions map(str → version) eval-toolkit version + any other load-bearing library versions
env map(str → version) Python, numpy, pandas, scipy, sklearn, torch, transformers, etc.
gpu_info map(str → val) count, memory_gb, name

Optional fields

Field Type Purpose
cuda_version string CUDA toolkit version at run time
dirty_flag bool True if the repo had uncommitted changes at capture (informational; canonical runs should always be clean)

Cross-references

  • Locked by: SPEC_GREENFIELD.md decision ledger §6 row “Per-row prediction persistence” (locked); §Tech-Stack rows “library version pins” (Phase 0 fills exact eval-toolkit version)
  • Verified by: Phase 4 evaluation outputs at evals/<run>/manifest.json; the tests/test_invariants.py:test_hyperparameter_immutability stub will check config_hash matches the committed config
  • Reviewer entry: docs/REPRODUCIBILITY.md references this schema for the fresh-clone recipe
  • Contamination taxonomy mapping: docs/THREAT_MODEL.md reference-scorer audit section; EVIDENCE.md §1-2 per-scorer verdicts

Why this schema

Reviewers will inspect manifests to verify reproducibility claims. Documenting the expected field list at seed time prevents guessing later. The schema is upstream-owned (eval-toolkit publishes the canonical JSON Schema); this project follows the upstream contract rather than inventing its own provenance format.