Reproducibility#
Background (skip if you’ve internalized this). A “reproducible” result is one that re-runs to bit-identical numbers given the same code, data, and configuration. ML reproducibility usually settles for statistical reproducibility: re-runs land in the same CI you originally reported. Both fail in different ways — bit-identity fails on cross-architecture GPU runs and mixed precision; statistical reproducibility fails on undocumented seeds, mutated input data, dependency drift. The NeurIPS Reproducibility Checklist formalizes the minimum information needed to re-run a result; this chapter walks through how eval-toolkit captures it.
eval-toolkit’s reproducibility primitives:
set_global_seeds— seeds numpy, random, and (optional) torch in one call.provenance.file_sha256— hash any input artifact for the manifest.-
provenance.make_run_dir— per-run directory + git provenance.
build_manifest+write_manifest— aggregates all of the above into onemanifest.jsonper run, NeurIPS-aligned.
Setup#
import json
import tempfile
from pathlib import Path
from eval_toolkit import build_manifest, write_manifest, MANIFEST_SCHEMA_VERSION
A reproducible run, end-to-end#
The minimum-viable reproducible run captures: code version, git SHA, dirty-flag, seeds, data hashes, env, GPU info, wall-clock time, and any inline leakage report.
m = build_manifest(
run_id="2026-05-08T15:00",
config={"model": "deberta-lora", "k_folds": 5, "seed": 42},
seeds={"global": 42, "bootstrap": 42},
extra_code_versions={"my_app": "0.1.0"},
)
print(f"schema: {m.schema_version}")
print(f"git_sha: {m.git_sha[:10] if m.git_sha else '<not in git repo>'}")
print(f"dirty_flag: {m.dirty_flag}")
print(f"config_hash: {m.config_hash[:30]}...")
print(f"env keys: {sorted(m.env.keys())}")
Write it to a run directory next to results.json:
with tempfile.TemporaryDirectory() as d:
path = write_manifest(m, d)
loaded = json.loads(path.read_text())
print(f"manifest fields: {sorted(loaded.keys())}")
Manifest fields ↔ NeurIPS Reproducibility Checklist#
The NeurIPS checklist demands ten artifacts. Eight map directly to manifest fields:
Checklist item |
Manifest field |
Source |
|---|---|---|
Code version |
|
|
Working-tree state |
|
|
Random seed |
|
caller-passed |
Data version |
|
|
Software env |
|
|
Compute resources |
|
|
Wall-clock time |
|
caller-passed (timed externally) |
Eval config |
|
sha256 of canonical-JSON config |
The two checklist items the toolkit can’t capture mechanically:
Hyperparameter search ranges. Belongs in your config — the manifest’s
config_hashwill catch any change to it.Number of training runs / variation across seeds. Captured by running
evaluate_folded(... seeds=(1, 2, 3))and reportingRunResult.fold_summary’sn_folds.
Croissant interoperability#
The Croissant
metadata format (MLCommons, 2024) is the de-facto standard for ML
dataset metadata, integrated with HuggingFace, Kaggle, and OpenML
covering 400 k+ datasets. eval-toolkit’s
DatasetLoader.describe() emits a
Croissant-compatible subset:
from eval_toolkit import DataFrameLoader
import pandas as pd
df = pd.DataFrame({"split": ["train", "test"], "text": ["a", "b"], "label": [0, 1]})
loader = DataFrameLoader(
df=df, split_col="split",
name="example",
cite_as="arXiv:0000.0000",
license="MIT",
url="https://example.com/dataset",
)
desc = loader.describe()
print(f"Croissant subset: {sorted(desc.keys())}")
The fields name / description / citeAs / license / url / distribution
match Croissant’s vocabulary. distribution carries
{name, contentUrl, sha256, contentSize} per file. Consumers who need
full Croissant production wrap eval-toolkit’s describe() output in
their own publishing pipeline.
End-to-end verification against HF Hub (v0.41.0)#
HFDatasetsLoader.describe() populates per-file sha256 from HF Hub’s
authoritative source-of-truth. The verification is exercised in
tests/test_croissant_e2e.py (marker @pytest.mark.integration).
Dual-source design — HFDatasetsLoader.describe() fetches from
two HF Hub endpoints:
Croissant (
/api/datasets/{repo}/croissant) — for the metadata vocabulary (name, description, license, citation, schema).Tree API (
/api/datasets/{repo}/tree/refs%2Fconvert%2Fparquet) — for per-filesha256(read from each file’slfs.oid, which equalssha256sumof the raw bytes).
Why dual sources? HF Hub’s Croissant emitter currently fills
distribution[].sha256 with a placeholder URL pointing at MLCommons
Croissant spec issue #80 (“In sha256 with real values (which will
equal the existing lfs.oid), HFDatasetsLoader switches sources
in ~5 LOC — no contract change for callers.
What the integration test verifies:
describe()['distribution'][i]['sha256']returns a realsha256:<64-hex>for each parquet shard.Downloading the shard from
contentUrland hashing the bytes produces the same value (bit-exact verification against the live HF Hub forstanfordnlp/sst2).Caller-provided overrides (
name=,cite_as=) win over Croissant fetches.fetch_remote_metadata=Falsepreserves pre-v0.41 behavior (no network).
This satisfies v1.0 readiness Gate 4 in spirit (the file we evaluated matches the source’s authoritative hash) and in literal form (when Croissant #80 lands; one-line migration).
PyTorch determinism — the sharp edges#
Bitwise reproducibility on GPU is harder than on CPU. The PyTorch 2.8 reproducibility notes document four sharp edges that every PyTorch eval pipeline silently hits unless explicitly addressed.
1. DataLoader worker seeding#
PyTorch DataLoader workers seed themselves independently of your
global seed unless you pass worker_init_fn= AND generator=. Default
behavior is silently non-deterministic.
# Sketch — requires torch, marked skip for Sybil.
import torch # noqa
from torch.utils.data import DataLoader # noqa
def seed_worker(worker_id):
import random
import numpy as np
worker_seed = torch.initial_seed() % 2**32
np.random.seed(worker_seed)
random.seed(worker_seed)
g = torch.Generator()
g.manual_seed(42)
# loader = DataLoader(dataset, batch_size=32, shuffle=True,
# worker_init_fn=seed_worker, generator=g)
2. CUBLAS_WORKSPACE_CONFIG must be set in the environment, BEFORE CUDA init#
This is the trap. CUBLAS allocates a workspace once per CUDA context;
once the context is created, setting the env var has no effect. Set it
in your shell or before any import torch.cuda-equivalent.
# Sketch — must run BEFORE any torch.cuda usage.
import os # noqa
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
# import torch # ← only AFTER setting the env var
The recommended setting is :4096:8. The variant :16:8 reduces
memory but degrades throughput; only use it if you’re memory-bound.
3. torch.use_deterministic_algorithms(True, warn_only=True)#
Some PyTorch ops have no deterministic kernel. True (without
warn_only) raises on the first one — not graceful for production.
warn_only=True falls back to non-deterministic for those ops with a
warning, while keeping everything else deterministic.
# import torch # noqa
# torch.use_deterministic_algorithms(True, warn_only=True)
# torch.backends.cudnn.deterministic = True
# torch.backends.cudnn.benchmark = False # disables autotuner; deterministic but slower
4. Mixed precision is NOT bitwise reproducible across GPU architectures#
Even with all flags set, bf16 / fp16 inference produces different logits on V100 vs A100 vs H100. This is the precision-vs-determinism trade-off and there’s no fix. Two implications:
Bootstrap CIs absorb the noise. A 5e-4 logit difference produces a metric difference well below the BCa CI width. “Statistical reproducibility” still holds; bit-identity does not.
Calibrate at inference precision. Don’t fit temperature on fp32 and deploy in bf16; the calibration drifts. See calibration.md §”PyTorch & transformer specifics”.
Replay recipe#
To re-run a result from its manifest:
Check
git_shaanddirty_flag. Hard-fail ifdirty_flag=True— the original run wasn’t reproducible to start with.git checkout <git_sha>.Recreate the env:
pip install eval-toolkit==<code_versions['eval_toolkit']>plus pinned versions of any other libraries listed incode_versions/env.Verify input data: hash every file with
provenance.file_sha256and confirm it matchesdata_hashes.Set seeds via
set_global_seeds(manifest['seeds']['global']).Re-run the eval entry point. The output
manifest.json’sconfig_hashshould match the original.
If you’re on a different GPU architecture than the original (CUDA major
version mismatch in cuda_version), expect statistical-but-not-bit
reproducibility — metrics within the BCa CI width.
Pitfalls / Common mistakes#
Setting seed AFTER imports. Some libraries (e.g.,
transformers,tokenizers) seed their RNG at import time. Callset_global_seedsfirst, before any other ML library imports.Forgetting the dataloader workers. A perfectly seeded model still shuffles data non-deterministically without
worker_init_fn+generator=.Trusting
dirty_flag=Falseto mean “perfectly reproducible”. It means “no uncommitted changes” — submodule state, lockfile drift, OS package versions are not in git. Combine withpyproject.toml/uv.lockpinning andenvsnapshot.Hashing data after preprocessing. Hash the raw inputs, not the output of your preprocessing pipeline. Otherwise a preprocessing bug is invisible to downstream replay.
Comparing manifests across CUDA versions. Different
cuda_version→ expect logit-level differences below CI width. Compare metrics, not raw scores.Logging seeds verbatim into a public artifact. Seeds are usually fine to share, but if your eval involves sensitive synthetic data generation, the seed reveals the data.
Putting it all together#
A full reproducible-run skeleton:
from eval_toolkit import set_global_seeds
# 1. Seeds first (before any heavy imports).
set_global_seeds(42)
# 2. Build manifest with seeds + config + data hashes.
m = build_manifest(
run_id="reproducibility-demo",
config={"k_folds": 5, "splitter": "StratifiedKFoldSplitter"},
seeds={"global": 42, "bootstrap": 42},
extra_code_versions={"my_app": "0.1.0"},
wall_clock_seconds=12.3, # measured externally
)
# 3. Emit alongside results.
with tempfile.TemporaryDirectory() as d:
manifest_path = write_manifest(m, d)
# results.json / results_full.json are written separately by
# eval_toolkit.harness.write_run_result(...)
print(f"manifest written: {manifest_path.name}")
print(f" schema_version: {m.schema_version}")
print(f" recorded fields: {len(m.to_dict())}")
Further reading#
NeurIPS Paper Checklist. https://neurips.cc/public/guides/PaperChecklist
PyTorch 2.8 reproducibility notes. https://docs.pytorch.org/docs/stable/notes/randomness.html
Croissant: A Metadata Format for ML-Ready Datasets. MLCommons, 2024. arXiv:2403.19546.
MLCommons Croissant spec. https://docs.mlcommons.org/croissant/
Pineau, J. et al. Improving reproducibility in machine learning research. JMLR 22, 2021.
See also: comparison.md (CIs absorb sub-CI-width noise), testing.md (golden tests for bit-identical reproducibility on CPU).