Reference scorer slate and contamination stratification — OpenAI plus Anthropic LLM-judges plus ProtectAI v1 and v2 plus per-axis matched-budget

Published

May 15, 2026

Superseded on one or more axes by ADR-050. The body below retains its original prose per the ADR-073 immutability rule; the corrected position lives in the superseding ADR. See the Decisions index to navigate.

ADR-018: Reference scorer slate and contamination stratification

Status

Accepted (2026-05-15). Partially supersedes ADR-015 reference-slate enumeration (drops Lakera Guard; adds ProtectAI v1). ADR-015’s transformer-architecture claims (single-backbone ModernBERT-base for the trained transformer slate) remain valid; this ADR refines only the reference-rung portion.

Context

ADR-007 originally framed the reference-rung slate as two LLM-judges (one OpenAI plus one Anthropic) plus two optional existing-classifier baselines (Lakera Guard plus ProtectAI). ADR-015 preserved that framing unchanged. Phase 0-03 Q3 surfaced three load-bearing methodology concerns that ADR-007 and ADR-015 did not address explicitly.

First, specific LLM-judge model IDs were deferred — Phase 0-03 must commit to snapshot identifiers (not aliases) so the eval is reproducible. Snapshot IDs are stable for approximately twelve months; aliases drift silently and break reproducibility — the rule from ADR-016 (SHA-pinning for data sources) extends to model versions.

Second, Lakera Guard inclusion carries ToS-audit overhead (commercial APIs often restrict benchmark publication) plus vendor-black-box methodology complexity. For prototype scope, the simpler call is to drop Lakera and rely on ProtectAI as the only off-the-shelf-classifier reference rung — the methodology story remains complete with ProtectAI plus the two LLM-judges, and the Lakera comparison can be named in the afterword as an extension that requires a separate ToS-verification step.

Third, Brandon’s Phase 0-03 Q3 framing surfaced an even sharper methodology concern — no reference rung is fully verified_disjoint relative to our eval slate. ProtectAI’s training corpus disclosure is partial. The LLM-judges (gpt-4o, claude-sonnet-4-6) are trained on essentially all public web text up to their cutoff and may have seen mosscap, HackAPrompt, PromptBench, and similar public eval datasets. Even the ModernBERT-base backbone (used for trained rungs 2 through 4) was pretrained on a web-scale corpus that may include our eval sources. The Phase 0-03 walk added ProtectAI v1 alongside v2 specifically to enable an internal off-the-shelf lift comparison (v1 to v2 lift — what classifier updates buy you, parallel to the trained-rung-lift narrative TF-IDF+LR to frozen-probe to LoRA to full-FT).

Combining these three concerns surfaces a stronger methodology contribution than the original ADR-007 framing — the rung slate now spans every level of the ADR-005 three-state contamination taxonomy, and the methodology spoke can explicitly stratify reference-rung interpretation by contamination disclosure. This turns the contamination concern from a footnote into a methodology axis.

Matched-budget controls (Q2 / ledger row 333) is the related methodology decision — should cross-rung comparisons hold compute budget constant? The natural answer for our heterogeneous slate is per-axis — match data and eval methodology (already locked by ADR-016 and ADR-006), let training compute vary, report it as a Pareto frontier. This handles the heterogeneous cost classes (LLM-judge dollars, GPU-minutes, inference-only) coherently without forcing artificial budget constraints.

Decision

Reference-rung slate (four rungs)

Rung	Model ID	Cost class	Contamination state
R1	gpt-4o-2024-08-06	API USD per call	vendor_black_box
R2	claude-sonnet-4-6 (snapshot ID pinned at Phase 1)	API USD per call	vendor_black_box
R3	protectai/deberta-v3-base-prompt-injection (v1)	local inference (CPU or GPU)	suspected_contamination
R4	protectai/deberta-v3-base-prompt-injection-v2	local inference (CPU or GPU)	suspected_contamination

Lakera Guard is dropped from the reference slate; named in WRITEUP/limitations-and-future-work.md as an afterword extension requiring ToS verification.

LLM-judge call framework (preserved from ADR-007)

Temperature equals zero (deterministic; multi-seed irrelevant)
One call per eval row
Prompt template versioned in src/judges/prompt_template_v1.md and documented; cross-judge prompt is identical (only the API endpoint differs)
Per-row predictions persisted to evals/predictions/__fold.parquet (no seed dimension since deterministic; no epoch dimension since no training)

ProtectAI v1-versus-v2 comparison framework

Both models loaded at native config (DeBERTa-v3-base, 512-token cap, no fine-tuning by us)
Both pinned via HF revision SHA in data/source_manifest.yaml (extended with models section)
Inference uses bf16 on GPU (matches trained-rung precision per ADR-019)
Per-row predictions persisted to evals/predictions/protectai-v1__fold.parquet and evals/predictions/protectai-v2__fold.parquet
Methodology spoke gains a v1-to-v2 lift subsection parallel to the trained-rung lift chain

Contamination stratification

Complete rung-slate contamination taxonomy after ADR-017 plus ADR-018 locks (eight rungs total):

Rung	Contamination state
TF-IDF + LR	verified_disjoint
ModernBERT-base frozen-probe	backbone-partial-disjoint
ModernBERT-base LoRA	backbone-partial-disjoint
ModernBERT-base full-FT	backbone-partial-disjoint
ProtectAI v1	suspected_contamination
ProtectAI v2	suspected_contamination
gpt-4o-2024-08-06	vendor_black_box
claude-sonnet-4-6	vendor_black_box

The methodology spoke includes a dedicated Contamination stratification subsection explaining the four-tier disclosure gradient and framing the reference-rung comparison as “what trained-from-scratch (TF-IDF+LR fully-disjoint anchor) achieves versus what potentially-memorized off-the-shelf models achieve” — any trained-rung lift over LLM-judges is despite the LLM-judge pretrain advantage.

Matched-budget controls (per-axis) — ledger row 333

Matched — data (same train and eval splits per ADR-016); eval methodology (same metrics, same statistical machinery per ADR-006)
Not matched — training compute (each rung uses natural recipe; ADR-017 plus ADR-019 specify each rung’s recipe)
Reported — training compute per rung in the writeup (wall-clock on the GPU class detected at runtime; per ADR-020 runpod-deploy primitives capture this in the per-pod manifest); the methodology spoke plots AUPRC versus compute as a Pareto frontier — the rung-ladder IS the Pareto frontier

Per-axis matching is the only framing that coherently handles the heterogeneous cost classes (LLM-judge dollars-per-call versus trained rungs GPU-minutes versus ProtectAI inference-only). Matched-training-compute would violate SPEC §2 hyperparameter-immutability (would require val-set tuning to find the budget cutoff) and would not fit the LLM-judge or ProtectAI cost classes coherently.

Consequences

Positive

Snapshot model IDs are pinned for reproducibility (gpt-4o-2024-08-06 stable; claude-sonnet-4-6 with date-suffixed snapshot resolved at Phase 1)
Contamination stratification turns a weakness into a methodology contribution — the rung slate now spans every level of the three-state taxonomy
ProtectAI v1-to-v2 lift comparison parallels the trained-rung lift chain — surfaces what off-the-shelf classifier updates buy you
Per-axis matched-budget aligns with SPEC §2 hyperparameter-immutability plus accommodates the heterogeneous rung cost classes
Dropping Lakera simplifies scope (no ToS audit, no vendor-black-box complexity beyond the LLM-judges)
API budget remains within A-002 envelope — approximately ten to twelve dollars total for both LLM-judge rungs

Negative

Dropping Lakera removes one commercial-API data point; mitigated by the afterword extension flag
The methodology spoke gains a dedicated contamination-stratification section plus a per-axis matched-budget section; the writeup is denser but more honest
A-006 is a new severity-medium assumption — surfaces in the WRITEUP caveats block per the reporting-completeness invariant
The four-reference-rung slate plus the four-trained-rung slate gives eight rungs total; the Cohen’s kappa pairwise matrix (preserved from ADR-007) becomes a 28-pair heatmap which is dense but readable

Phase 1 deliverables

data/source_manifest.yaml — extend with models section (ProtectAI v1 plus v2 HF revision SHAs) plus judges section (gpt-4o snapshot ID plus claude-sonnet-4-6 snapshot ID with date suffix resolved at Phase 1)
src/judges/prompt_template_v1.md — versioned LLM-judge prompt template
src/judges/openai_caller.py and src/judges/anthropic_caller.py — temp=0 API wrappers with manifest-pinned snapshot IDs
src/rungs/protectai_v1.py and src/rungs/protectai_v2.py — HF inference wrappers at native config with bf16 on GPU
evals/predictions/{gpt-4o,claude-sonnet-4-6,protectai-v1,protectai-v2}__fold.parquet — per-rung per-fold predictions
EVIDENCE.md — contamination-state entry per reference rung per the 3-state taxonomy
WRITEUP methodology spoke — Contamination stratification subsection plus Matched-budget framing subsection

Alternatives considered

Include Lakera Guard with vendor-black-box framing plus ToS audit — rejected for prototype scope; the ToS audit overhead and the partial-disclosure complexity exceed the marginal methodology value of one additional commercial-API data point. Named in afterword extension.
Drop LLM-judges entirely (use ProtectAI only) — rejected because LLM-judges provide an upper-bound signal under maximum-memorization conditions (“ceiling achievable by a frontier model that may have memorized everything”); if our trained rungs approach or beat that ceiling despite the contamination disadvantage, the result is stronger; if we lose, the contamination framing upper-bounds the gap. The Cohen’s kappa pairwise structure also benefits from the LLM-judge rungs.
Use GPT-4o-mini or Claude Haiku for cost — rejected because budget-tier judges risk strawman framing if they underperform; the methodology question is “is the LLM-judge a strong baseline?” which requires capable mid-tier models. The mid-tier (gpt-4o + claude-sonnet-4-6) pair adds approximately ten dollars to the budget — negligible against A-002 envelope.
Use frontier-tier judges (gpt-4.1 or claude-opus-4-7) — rejected for prototype; mid-tier is the methodology-balanced choice. Frontier-tier comparison named in afterword extension as ablation study at smaller eval slice.
Use ProtectAI v2 only without v1 — rejected because the v1-to-v2 lift comparison parallels the trained-rung-lift narrative and surfaces what off-the-shelf updates buy you; adding v1 is one extra inference run per fold at near-zero compute cost.
Yes-match-training-compute (give every trained rung the same budget) — rejected because it violates SPEC §2 hyperparameter-immutability (matched-budget would force val-set tuning to find the budget cutoff) plus does not fit LLM-judge or ProtectAI cost classes coherently.
No matched-budget (natural recipes, no compute reporting) — rejected because reviewer may push back on “of course full-FT wins, it had more compute”; per-axis matching exposes the compute axis explicitly as a Pareto frontier rather than hiding it.

References

See frontmatter references list. Primary anchors — Zheng et al. 2023 LLM-as-a-Judge methodology; Zhou et al. 2024 LLM evaluation contamination survey; Sainz et al. 2023 NLP evaluation in trouble (contamination disclosure central to eval methodology); MMLU-Pro 2024 contamination-stratified evaluation; Hooker 2021 Hardware Lottery (matched-compute critique); OpenAI deprecation policy for snapshot IDs; ProtectAI v1 plus v2 model cards; HuggingFace PEFT comparative study; EleutherAI GPT-NeoX paper §6 (multi-rung evaluation framing); ADR-015 transformer-slate single-backbone claim; ADR-007 historical record; ADR-005 contamination taxonomy; ADR-016 data design plus source manifest.