# Versioning Tier-2 implementations > **Background** *(skip if you've internalized this)*. Cross-version > metric comparisons are silently misleading. If a `Scorer`'s scoring > recipe changed between two runs (different model checkpoint, bumped > LoRA rank, new prompt template), reporting "PR-AUC went up by 0.04" > is meaningless because the metric isn't measuring the same thing. > `lm-evaluation-harness` solved this by stamping every task with a > `VERSION` field, surfaced in every output. eval-toolkit adopts the > same pattern via the > [`Versioned`](../api/leakage.md) opt-in Protocol. This chapter covers when to expose a `version` attribute on your Tier-2 implementations (Scorer, LeakageCheck, Splitter, ThresholdSelector, DatasetLoader) and how it threads into the `RunManifest`. ## Setup ```python from eval_toolkit import Versioned, build_manifest, write_manifest import tempfile ``` (versioned-protocol)= ## The Versioned Protocol The toolkit ships [`Versioned`](../api/leakage.md) as a one-attribute runtime-checkable Protocol: ```python # from eval_toolkit.leakage import Versioned # @runtime_checkable # class Versioned(Protocol): # @property # def version(self) -> str: ... ``` **Opt-in**: implementations are not required to expose `version`. If present, `build_manifest(versioned=...)` auto-collects it into `RunManifest.versioned_objects`; if absent, the object is silently skipped (per `_collect_versioned` in [`manifest.py`](../api/manifest.md)). (manifest-thread)= ## Threading through to the manifest Pass any iterable or mapping of Tier-2 objects via the `versioned` parameter: ```python class _Scorer: """A toy Scorer with a version attribute.""" version = "lr-tfidf-v1.2.0" def predict_proba(self, X): import numpy as np return np.full(len(X), 0.5) class _ScorerNoVersion: """No version attribute → silently skipped.""" def predict_proba(self, X): import numpy as np return np.full(len(X), 0.5) m = build_manifest( run_id="versioning-demo", config={"model": "lr-tfidf"}, versioned={"my_scorer": _Scorer(), "no_version": _ScorerNoVersion()}, ) print(m.versioned_objects) # {"my_scorer": "lr-tfidf-v1.2.0"} ``` The mapping form gives you stable keys (recommended); the sequence form keys by `type(obj).__name__`. Per [manifest.py § _collect_versioned](../api/manifest.md), both work. (version-string)= ## Choosing a version string A good version string is **a fingerprint that changes whenever the metric's interpretation changes**. Conventions that work: - **Semver**: `"v1.2.0"` — bump on any change to scoring logic. Good for stable models. - **Date + hash**: `"2026-05-08-abc1234"` — bump on every retraining. Good for frequently-retrained models. - **Composite**: `"deberta-v3-base-lora-rank8-2026-q1"` — embeds the recipe in the string. Self-documenting; verbose but unambiguous. - **Checkpoint SHA**: `f"deberta-lora-{checkpoint_sha[:8]}"` — appends the model artifact's hash. Bit-identical replay possible. **What NOT to do**: use a static `"v1"` for the lifetime of the project. The whole point of the field is to *change* when the recipe changes; a frozen string defeats it. (versioning-when)= ## When to expose `version` Expose it on every Tier-2 implementation whose output your reports depend on. In practice that's: - **Every `Scorer`**. If you have a regex baseline + an LR + a LoRA transformer + an LLM-judge, all four expose distinct version strings. - **Custom `LeakageCheck` impls** if you maintain them — bumping the check's logic should invalidate prior leakage reports. - **Custom `Splitter` / `DatasetLoader` impls** if their output is non-deterministic across versions (e.g., a fold-assignment bug fix would bump the version). The toolkit's *built-in* reference impls intentionally don't expose `version` — they're versioned indirectly via `code_versions["eval_toolkit"]` in the manifest. Bumping the toolkit version invalidates everything; you don't need per-class versions on top. (examples)= ## Convention examples (consumer-side) From `prompt-injection-clean/docs/eval_toolkit_gaps.md` Gap 4 — the canonical mapping for that project's 5 scorers: ```python class LRBaselineScorer: """sklearn LogisticRegression on TF-IDF features.""" version = "lr-tfidf-v1" class FrozenProbeScorer: """Frozen DeBERTa-v3-base + sklearn LR head.""" version = "frozen-deberta-v3-base-v1" class ProtectAIScorer: """HuggingFace `protectai/deberta-v3-base-prompt-injection`.""" def __init__(self, variant: str, revision: str) -> None: self.variant = variant self.revision = revision # Embeds variant + revision SHA prefix → bumps when either changes. self.version = f"protectai-{variant}-{revision[:8]}" class LoRAScorer: """In-house DeBERTa + LoRA fine-tune.""" def __init__(self, checkpoint_sha: str) -> None: self.checkpoint_sha = checkpoint_sha self.version = f"deberta-lora-v1-{checkpoint_sha[:8]}" ``` Then at eval time: ```python class _LRBaselineScorer: version = "lr-tfidf-v1" def predict_proba(self, X): import numpy as np return np.full(len(X), 0.5) class _LoRAScorer: def __init__(self, sha): self.version = f"deberta-lora-v1-{sha[:8]}" def predict_proba(self, X): import numpy as np return np.full(len(X), 0.5) scorers = {"lr": _LRBaselineScorer(), "lora": _LoRAScorer("abc123def456")} m = build_manifest(run_id="r", config={}, versioned=scorers) print(m.versioned_objects) ``` (versioning-pitfalls)= ## Pitfalls / Common mistakes - **Static version strings.** A scorer with `version = "v1"` that never changes is worse than no version at all — it gives a false sense of fingerprinting. - **Version strings that include timestamps that aren't fingerprints.** `f"lr-{datetime.now().isoformat()}"` changes every run, making the manifest's `versioned_objects` field uselessly noisy. Use a stable recipe-fingerprint (semver, hash, frozen-date), not wall-clock. - **Forgetting to bump.** If you change a scorer's hyperparameter (LoRA rank, regex pattern set, LLM model name) without bumping the version, the manifest looks the same as before — but the metric isn't comparable. Discipline matters. - **Not passing `versioned=` to `build_manifest`.** The manifest builder doesn't auto-discover Versioned objects; you must pass them explicitly. Cross-link from the harness script you wrote: typically `versioned=scorers` (the dict you also pass to `evaluate(...)`). - **Versioning eval-toolkit primitives separately.** Don't expose `version` on a vanilla `MaxF1Selector()` — it's already fingerprinted by `code_versions["eval_toolkit"]`. The Versioned Protocol is for *your* code, not the toolkit's reference impls. ## Putting it all together ```python from eval_toolkit import ( EvalSlice, evaluate, build_manifest, write_manifest, set_global_seeds, ) import pandas as pd import numpy as np set_global_seeds(42) class _RegexScorer: version = "regex-v1.0" def predict_proba(self, X): return np.full(len(X), 0.5) class _LRScorer: version = "lr-tfidf-v1.2.0" def predict_proba(self, X): return np.full(len(X), 0.5) scorers = {"regex": _RegexScorer(), "lr": _LRScorer()} df = pd.DataFrame({"text": ["a", "b"], "label": [0, 1]}) result = evaluate(scorers, [EvalSlice(name="test", df=df)], run_id="r") m = build_manifest( run_id="r", config={"scorers": list(scorers.keys())}, versioned=scorers, # ← auto-collects {regex: "regex-v1.0", lr: "lr-tfidf-v1.2.0"} ) import tempfile, json with tempfile.TemporaryDirectory() as d: path = write_manifest(m, d) print(json.loads(path.read_text())["versioned_objects"]) ``` (schema-evolution)= ## Schema evolution policy The toolkit's JSON Schemas live in [`src/eval_toolkit/schemas/`](https://github.com/brandon-behring/eval-toolkit/tree/main/src/eval_toolkit/schemas) and follow a `.vN.json` filename convention (`results.v1.json`, `results_full.v1.json`, `manifest.v1.json`). This section documents when the filename bumps and when it stays. ### Filename stays the same (additive changes) A `.vN.json` filename is **stable** so long as every change is additive and optional. Concretely, the filename does NOT bump when: - A new top-level optional field is added. - A new optional sub-field is added inside an existing object. - A field's `enum` gains a value (consumers using older enums still validate; their value is just one of several allowed). - Documentation strings (`description`) change. The forward-compatibility contract is `additionalProperties: true` on every object node: v0.8 consumers reading a v0.9 result silently tolerate the new fields. v0.9 consumers reading a hypothetical v0.10 result tolerate that version's additions in the same way, as long as v0.10 stays additive. **v0.9 case study**: `results.v1.json` gained six new top-level optional fields (`claim_report`, `prediction_artifacts`, `evidence_axes`, `pairing_metadata`, `aggregate_evidence`, `threshold_policy`) without a filename bump. v0.8 consumers continue to read v0.9 outputs cleanly. ### Filename bumps (`.vN.json` → `.v(N+1).json`) A new schema file ships under a bumped filename when any of these apply: - A field is **removed** from the schema. - A field is **renamed**. - A field's **semantic meaning** changes (same name, different interpretation downstream). - A field's **type** changes (e.g., `string` → `integer`). - A field is **moved** between `properties` and `required`. - An object node's `additionalProperties` policy tightens from `true` to `false`. Filename bumps are announced in the per-version migration guide under `docs/migration/` and called out in `CHANGELOG.md`. When `v2` ships, `v1` stays on disk for one more release cycle to give consumers time to migrate; consumers can pin to the `.v1.json` filename until they upgrade. ### The `schema_version` field inside each schema Each schema also declares a top-level `"version": ""` and (for result schemas) a `schema_version` const mirroring the filename: ```json { "$id": "https://eval-toolkit/schemas/results.v1.json", "version": "1", "properties": { "schema_version": {"const": "v1"} } } ``` The two fields are redundant on purpose: `$id` and `version` are for schema-aware tools; `schema_version` is for human consumers who'd rather grep a result file than its schema. Both are bumped together on a filename change. ### Why not semantic versioning (`v1.1.json`)? The `.vN.json` convention is intentionally coarser than semver. Any breaking change → new file. There's no notion of a backward- compatible-but-non-additive bump (`v1.1` adding a new required field would still break v1 consumers; calling it `v1.1` would mislead). Additive changes — the only ones that don't break — get no filename change at all. This keeps consumer code simple: pin to `.v1.json` and you're guaranteed forward-compat within the v1 line. ### See also - [`migration/v0.9.md`](../migration/v0.9.md) §5 for the v0.9 schema additions. - [`CHANGELOG.md`](https://github.com/brandon-behring/eval-toolkit/blob/main/CHANGELOG.md) for the per-version record of which schemas grew. ## Further reading - EleutherAI lm-evaluation-harness, [`task_guide`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md) — the canonical `VERSION`-field pattern this Protocol mirrors. - Stanford HELM, [`schema_classic.yaml`](https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/static/schema_classic.yaml) — versioned-as-artifact pattern for benchmark schemas. - Mitchell, M. et al. *Model Cards for Model Reporting.* FAccT 2019. — the canonical "what should be in a model artifact's metadata" reference; `version` is one of the required fields. See also: [reproducibility.md](reproducibility.md) (the manifest the versioned objects land in), [extending.md §"Implementing a Scorer"](../extending.md#scorer) (where to add the `version` attribute on a custom Scorer).