Reproducibility#

Background (skip if you’ve internalized this). A “reproducible” result is one that re-runs to bit-identical numbers given the same code, data, and configuration. ML reproducibility usually settles for statistical reproducibility: re-runs land in the same CI you originally reported. Both fail in different ways — bit-identity fails on cross-architecture GPU runs and mixed precision; statistical reproducibility fails on undocumented seeds, mutated input data, dependency drift. The NeurIPS Reproducibility Checklist formalizes the minimum information needed to re-run a result; this chapter walks through how eval-toolkit captures it.

eval-toolkit’s reproducibility primitives:

set_global_seeds — seeds numpy, random, and (optional) torch in one call.
provenance.file_sha256 — hash any input artifact for the manifest.
provenance.capture_git_sha
- provenance.make_run_dir — per-run directory + git provenance.
build_manifest + write_manifest — aggregates all of the above into one manifest.json per run, NeurIPS-aligned.

Setup#

import json
import tempfile
from pathlib import Path
from eval_toolkit import build_manifest, write_manifest, MANIFEST_SCHEMA_VERSION

A reproducible run, end-to-end#

The minimum-viable reproducible run captures: code version, git SHA, dirty-flag, seeds, data hashes, env, GPU info, wall-clock time, and any inline leakage report.

m = build_manifest(
    run_id="2026-05-08T15:00",
    config={"model": "deberta-lora", "k_folds": 5, "seed": 42},
    seeds={"global": 42, "bootstrap": 42},
    extra_code_versions={"my_app": "0.1.0"},
)
print(f"schema: {m.schema_version}")
print(f"git_sha: {m.git_sha[:10] if m.git_sha else '<not in git repo>'}")
print(f"dirty_flag: {m.dirty_flag}")
print(f"config_hash: {m.config_hash[:30]}...")
print(f"env keys: {sorted(m.env.keys())}")

Write it to a run directory next to results.json:

with tempfile.TemporaryDirectory() as d:
    path = write_manifest(m, d)
    loaded = json.loads(path.read_text())
    print(f"manifest fields: {sorted(loaded.keys())}")

Manifest fields ↔ NeurIPS Reproducibility Checklist#

The NeurIPS checklist demands ten artifacts. Eight map directly to manifest fields:

Checklist item	Manifest field	Source
Code version	`git_sha`, `code_versions`	`provenance.capture_git_sha`, package metadata
Working-tree state	`dirty_flag`	`git status --porcelain`
Random seed	`seeds`	caller-passed
Data version	`data_hashes`	`provenance.file_sha256` (sha256 of every input)
Software env	`env`	`sys.version`, `platform.platform()`, importable lib `__version__`s
Compute resources	`gpu_info`, `cuda_version`	`nvidia-smi --query-gpu` (graceful fallback)
Wall-clock time	`wall_clock_seconds`	caller-passed (timed externally)
Eval config	`config_hash`	sha256 of canonical-JSON config

The two checklist items the toolkit can’t capture mechanically:

Hyperparameter search ranges. Belongs in your config — the manifest’s config_hash will catch any change to it.
Number of training runs / variation across seeds. Captured by running evaluate_folded(... seeds=(1, 2, 3)) and reporting RunResult.fold_summary’s n_folds.

Croissant interoperability#

The Croissant metadata format (MLCommons, 2024) is the de-facto standard for ML dataset metadata, integrated with HuggingFace, Kaggle, and OpenML covering 400 k+ datasets. eval-toolkit’s DatasetLoader.describe() emits a Croissant-compatible subset:

from eval_toolkit import DataFrameLoader
import pandas as pd

df = pd.DataFrame({"split": ["train", "test"], "text": ["a", "b"], "label": [0, 1]})
loader = DataFrameLoader(
    df=df, split_col="split",
    name="example",
    cite_as="arXiv:0000.0000",
    license="MIT",
    url="https://example.com/dataset",
)
desc = loader.describe()
print(f"Croissant subset: {sorted(desc.keys())}")

The fields name / description / citeAs / license / url / distribution match Croissant’s vocabulary. distribution carries {name, contentUrl, sha256, contentSize} per file. Consumers who need full Croissant production wrap eval-toolkit’s describe() output in their own publishing pipeline.

End-to-end verification against HF Hub (v0.41.0)#

HFDatasetsLoader.describe() populates per-file sha256 from HF Hub’s authoritative source-of-truth. The verification is exercised in tests/test_croissant_e2e.py (marker @pytest.mark.integration).

Dual-source design — HFDatasetsLoader.describe() fetches from two HF Hub endpoints:

Croissant (/api/datasets/{repo}/croissant) — for the metadata vocabulary (name, description, license, citation, schema).
Tree API (/api/datasets/{repo}/tree/refs%2Fconvert%2Fparquet) — for per-file sha256 (read from each file’s lfs.oid, which equals sha256sum of the raw bytes).

Why dual sources? HF Hub’s Croissant emitter currently fills distribution[].sha256 with a placeholder URL pointing at MLCommons Croissant spec issue #80 (“In , check SHA256 or MD5”) which is open. The Croissant spec itself doesn’t yet require per-file checksums from emitters, and HF Hub is honest about it — they punt the field rather than fabricate a hash. The authoritative hash IS available, just via the tree API. When MLCommons #80 resolves and HF Hub starts populating Croissant sha256 with real values (which will equal the existing lfs.oid), HFDatasetsLoader switches sources in ~5 LOC — no contract change for callers.

What the integration test verifies:

describe()['distribution'][i]['sha256'] returns a real sha256:<64-hex> for each parquet shard.
Downloading the shard from contentUrl and hashing the bytes produces the same value (bit-exact verification against the live HF Hub for stanfordnlp/sst2).
Caller-provided overrides (name=, cite_as=) win over Croissant fetches.
fetch_remote_metadata=False preserves pre-v0.41 behavior (no network).

This satisfies v1.0 readiness Gate 4 in spirit (the file we evaluated matches the source’s authoritative hash) and in literal form (when Croissant #80 lands; one-line migration).

PyTorch determinism — the sharp edges#

Bitwise reproducibility on GPU is harder than on CPU. The PyTorch 2.8 reproducibility notes document four sharp edges that every PyTorch eval pipeline silently hits unless explicitly addressed.

1. DataLoader worker seeding#

PyTorch DataLoader workers seed themselves independently of your global seed unless you pass worker_init_fn= AND generator=. Default behavior is silently non-deterministic.

# Sketch — requires torch, marked skip for Sybil.
import torch  # noqa
from torch.utils.data import DataLoader  # noqa

def seed_worker(worker_id):
    import random
    import numpy as np
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

g = torch.Generator()
g.manual_seed(42)

# loader = DataLoader(dataset, batch_size=32, shuffle=True,
#                     worker_init_fn=seed_worker, generator=g)

2. `CUBLAS_WORKSPACE_CONFIG` must be set in the environment, BEFORE CUDA init#

This is the trap. CUBLAS allocates a workspace once per CUDA context; once the context is created, setting the env var has no effect. Set it in your shell or before any import torch.cuda-equivalent.

# Sketch — must run BEFORE any torch.cuda usage.
import os  # noqa
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
# import torch  # ← only AFTER setting the env var

The recommended setting is :4096:8. The variant :16:8 reduces memory but degrades throughput; only use it if you’re memory-bound.

3. `torch.use_deterministic_algorithms(True, warn_only=True)`#

Some PyTorch ops have no deterministic kernel. True (without warn_only) raises on the first one — not graceful for production. warn_only=True falls back to non-deterministic for those ops with a warning, while keeping everything else deterministic.

# import torch  # noqa
# torch.use_deterministic_algorithms(True, warn_only=True)
# torch.backends.cudnn.deterministic = True
# torch.backends.cudnn.benchmark = False  # disables autotuner; deterministic but slower

4. Mixed precision is NOT bitwise reproducible across GPU architectures#

Even with all flags set, bf16 / fp16 inference produces different logits on V100 vs A100 vs H100. This is the precision-vs-determinism trade-off and there’s no fix. Two implications:

Bootstrap CIs absorb the noise. A 5e-4 logit difference produces a metric difference well below the BCa CI width. “Statistical reproducibility” still holds; bit-identity does not.
Calibrate at inference precision. Don’t fit temperature on fp32 and deploy in bf16; the calibration drifts. See calibration.md §”PyTorch & transformer specifics”.

Replay recipe#

To re-run a result from its manifest:

Check git_sha and dirty_flag. Hard-fail if dirty_flag=True — the original run wasn’t reproducible to start with.
git checkout <git_sha>.
Recreate the env: pip install eval-toolkit==<code_versions['eval_toolkit']> plus pinned versions of any other libraries listed in code_versions / env.
Verify input data: hash every file with provenance.file_sha256 and confirm it matches data_hashes.
Set seeds via set_global_seeds(manifest['seeds']['global']).
Re-run the eval entry point. The output manifest.json’s config_hash should match the original.

If you’re on a different GPU architecture than the original (CUDA major version mismatch in cuda_version), expect statistical-but-not-bit reproducibility — metrics within the BCa CI width.

Pitfalls / Common mistakes#

Setting seed AFTER imports. Some libraries (e.g., transformers, tokenizers) seed their RNG at import time. Call set_global_seeds first, before any other ML library imports.
Forgetting the dataloader workers. A perfectly seeded model still shuffles data non-deterministically without worker_init_fn + generator=.
Trusting dirty_flag=False to mean “perfectly reproducible”. It means “no uncommitted changes” — submodule state, lockfile drift, OS package versions are not in git. Combine with pyproject.toml / uv.lock pinning and env snapshot.
Hashing data after preprocessing. Hash the raw inputs, not the output of your preprocessing pipeline. Otherwise a preprocessing bug is invisible to downstream replay.
Comparing manifests across CUDA versions. Different cuda_version → expect logit-level differences below CI width. Compare metrics, not raw scores.
Logging seeds verbatim into a public artifact. Seeds are usually fine to share, but if your eval involves sensitive synthetic data generation, the seed reveals the data.

Putting it all together#

A full reproducible-run skeleton:

from eval_toolkit import set_global_seeds

# 1. Seeds first (before any heavy imports).
set_global_seeds(42)

# 2. Build manifest with seeds + config + data hashes.
m = build_manifest(
    run_id="reproducibility-demo",
    config={"k_folds": 5, "splitter": "StratifiedKFoldSplitter"},
    seeds={"global": 42, "bootstrap": 42},
    extra_code_versions={"my_app": "0.1.0"},
    wall_clock_seconds=12.3,  # measured externally
)

# 3. Emit alongside results.
with tempfile.TemporaryDirectory() as d:
    manifest_path = write_manifest(m, d)
    # results.json / results_full.json are written separately by
    # eval_toolkit.harness.write_run_result(...)
    print(f"manifest written: {manifest_path.name}")
    print(f"  schema_version: {m.schema_version}")
    print(f"  recorded fields: {len(m.to_dict())}")