Data design bundle — source slate, splits, HF pinning, dedup, ordering, ceilings, ref-scorer audit

Published

May 15, 2026

ADR-016: Data design bundle — source slate, splits, HF pinning, dedup, ordering, ceilings, ref-scorer audit

Status

Accepted (2026-05-15)

Context

ADR-008 (data scope — public-only sources, hybrid splits, NotInject inclusion) locked the brief-level data-design direction. Phase 0-02 walks the methodology-specifics that ADR-008 deferred — which datasets, what dedup encoder, what split count, what subsample ceilings, what reference-scorer audit pattern. The walk surfaced a load-bearing prototype-scope reframing — under 2.5 working days plus a focused-PDF submission target, methodology depth should exceed methodology breadth, and several originally-considered methodology-extensions (stratified k-fold within LODO, broader direct-OOD coverage, full 4-gate dedup calibration, cross-source same-style ablation) are deferred to a named afterword that prioritizes them as next-iteration work.

A second load-bearing finding from the walk is the dossier gap on training benigns — ADR-008 line 31 noted that “a training-benigns slate is not in the dossier” and named plausible candidates (LMSYS-Chat-1M, ShareGPT, UltraChat) without verification. Phase 0-02 surfaces this gap explicitly and resolves it by locking a two-source benign mix (LMSYS-Chat-1M plus UltraChat) with mandatory Phase 1 contamination scan; the gap is named in WRITEUP/limitations-and-future-work.md as a methodology-honesty artifact rather than papered over.

A third finding from the walk is researcher-auditability as a methodology constraint — surfaced during the Phase 0-01 multilingual-scope discussion and carried into Phase 0-02 — the researcher cannot independently audit non-English samples, which combined with the all-English dossier-vetted eval slices justifies the English-only scope from ADR-014 Q2 and shapes the benign-pool English-filter requirement applied to LMSYS-Chat-1M (which contains some non-English content).

The seven sub-decisions in this ADR are presented as a single bundle because they cohere under the prototype framing — clear claim shape with explicit future-work axes named. Each sub-decision references its individual SPEC_GREENFIELD ledger row.

Decision

Q1 — Source slate (row 323) — Path alpha

Train positives (4 sources, LODO-rotational):

Source	HF revision pin location	License	Approx N post-dedup	LODO fold
`deepset/prompt-injections`	Phase 1 lock	Apache-2.0	~500-650 (use all)	1
`Lakera/gandalf_ignore_instructions`	Phase 1 lock	MIT	~800-1000 (use all)	2
`Lakera/mosscap_prompt_injection`	Phase 1 lock	MIT	~3000 (cap; see Q6)	3
`hackaprompt/hackaprompt-dataset`	Phase 1 lock	per dataset card	~3000 (cap; see Q6)	4

Train benigns (2 sources, stratified across folds):

Source	HF revision pin location	License	Approx N post-dedup	Filter
`lmsys/lmsys-chat-1m`	Phase 1 lock	CC-BY-4.0	~10000 (cap; see Q6)	English-only language filter applied before subsample
`HuggingFaceH4/ultrachat_200k`	Phase 1 lock	Apache-2.0	~10000 (cap; see Q6)	None

OOD eval slate (5 slices, never-trained-on):

Source	Type pin location	License	Approx N	Role
`leolee99/NotInject`	HF SHA at Phase 1	MIT	339	OOD hard-neg (over-defense)
`paul-rottger/xstest`	git commit SHA at Phase 1	per repo	450	OOD hard-neg (over-refusal)
`JailbreakBench/JBB-Behaviors`	HF SHA at Phase 1	MIT	200	OOD mixed (100 harmful plus 100 benign)
`microsoft/BIPIA`	git commit SHA at Phase 1	per repo	per-task	OOD indirect (zero-shot per ADR-014 Q1)
`uiuc-kang-lab/InjecAgent`	git commit SHA at Phase 1	per repo	1054	OOD agentic (stretch probe per ADR-010 Bound 2)

Afterword extensions (named in WRITEUP/limitations-and-future-work.md): - HarmBench standard subset (Mazeika et al. 2024, ICML; arXiv:2402.04249) — broader direct-OOD coverage; engagement-level upgrade from ADR-012 “cite + acknowledge” to “replicate” - Tensor Trust subsample (Toyer et al. 2023, ICLR 2024; arXiv:2311.01011) — human-adversarial-direct diversity at ~1-2K subsample - LLMail-Inject (Abdelnabi et al. 2025; arXiv:2506.09956) — adaptive email-context indirect

Q2 — Splits structure (row 327) — LODO k=4 plus 3 seeds; no internal k-fold

Mechanism: source-disjoint Leave-One-Dataset-Out at the outer level over the 4 positive sources; for each LODO fold, the in-distribution training pool is 3 positive sources plus the deduped benign pool; the held-out positive source serves as the cross-source test. Within each LODO fold there is a single 80/20 train/val random split (not nested k-fold) — val is used for threshold selection plus calibration fitting plus early-stopping per ADR-011 Guarantee 6 (no adaptive threshold selection on test).

Variance estimation: 3 random initialization seeds per LODO fold per trained rung — this is the ADR-006 floor; matches the Path B compute envelope.

Per-rung total: 4 LODO folds times 3 seeds equals 12 observations per rung. Submission total: 3 ModernBERT-base trained rungs (frozen-probe, LoRA, full-FT per ADR-015) times 12 equals 36 trained runs.

Bootstrap CI mechanics: per-rung aggregate AUROC is the mean over 12 observations; per-rung CI is bootstrap (10K iterations, BCa marginal) across the 12 (LODO-fold, seed) observations; rung-vs-rung paired-bootstrap differences use the same (LODO-fold, seed) pairing.

MDE on Delta-AUROC: approximately 0.03 (typical for n=12 paired observations) — reportable; honestly characterizes prototype precision.

Afterword extension: stratified k-fold within LODO (Fomin 2025 plus Nadeau-Bengio 2003 methodology) for proper within-source-variance decomposition — approximately 5x compute multiplier.

Q3 — HF dataset revision pinning (row 324) — SHA-pin at Phase 1; unified manifest; manifest-documented bumps

Mechanism: At Phase 1 entry, each source’s revision is pinned in data/source_manifest.yaml — HF datasets pin to HF revision SHA; GitHub-cloneable sources pin to git commit SHA. The manifest carries per-source row counts plus license plus role plus pin reference plus bump history.

Bump policy: bumps are documented as YAML diff entries in data/source_manifest.yaml under bump_history. ADR per bump is required only for schema changes (column adds, removes, type changes); label corrections plus content additions plus row removals are manifest-only.

CI invariant: test_source_manifest_schema_valid (skip-marked stub at Phase 0 close; implemented Phase 1) asserts the manifest parses, contains all 11 sources, each has SHA plus license plus role.

Q4 — Dedup encoder plus threshold (row 325) — MiniLM-L6-v2 cosine at 0.80; simplified calibration

Encoder: sentence-transformers/all-MiniLM-L6-v2 (~22M parameters, 384-dim embeddings). Within-(source, label) cosine similarity computed pairwise; pairs above threshold are deduplicated by keeping the first occurrence per the SPEC_GREENFIELD deterministic-ordering lock.

Threshold: 0.80 — eval-toolkit kit hint default; industry-standard starting point.

Calibration evidence (evals/dedup_calibration.json): - Cosine distribution histogram per (source, label) pair — anisotropy sanity check (Ethayarajh 2019) - FPR plus FNR at threshold 0.80 against a hand-curated 50-pair labeled holdout (positive pairs known near-duplicates; negative pairs known non-duplicates) - Dedup counts at thresholds {0.75, 0.80, 0.85} for sensitivity

Afterword extensions: MPNet-base-v2 encoder upgrade (5x compute); full 4-gate selection rule (FPR plus FNR plus stability plus calibration-curve persistence) with larger labeled holdout (~200-500 pairs).

Q5 — Cross-source benign dedup ordering (row 326) — within-source-first; LMSYS-priority tiebreak

Pipeline order:

Load each benign source; apply English-only language filter to LMSYS-Chat-1M; subsample each to ~10K with seed=42
Within-source dedup pass: MiniLM cosine at 0.80; per-source first-occurrence retention
Cross-source dedup pass: same MiniLM threshold; LMSYS-Chat-1M priority on cross-source near-duplicates (LMSYS is real-user data; UltraChat is synthetic — preserve real-data signal when both score near-duplicate)
Split into LODO folds; stratify benign assignment so each LODO fold receives approximately equal share of post-dedup LMSYS plus post-dedup UltraChat

Rationale: ADR-008 already locked the before-split direction; this Q5 sub-decision picks within-source-first to preserve per-source first-occurrence determinism and to keep cross-source dedup as a smaller second-pass operation over the within-source-deduplicated pools.

Q6 — Benign subsample ceilings per source (row 329) — 3K positives per source; 10K benigns per source; random seed=42

Per-source caps:

Source	Cap	Selection
`deepset/prompt-injections`	None (use all post-dedup, approx 500-650)	All
`Lakera/gandalf_ignore_instructions`	None (use all post-dedup, approx 800-1000)	All
`Lakera/mosscap_prompt_injection`	3000	Random with seed=42
`hackaprompt/hackaprompt-dataset`	3000	Random with seed=42
`lmsys/lmsys-chat-1m`	10000	Random with seed=42 after English-only filter
`HuggingFaceH4/ultrachat_200k`	10000	Random with seed=42

Class balance under caps: per-LODO-fold training pool is approximately 6-8K positives (3 sources of capped approximately 2-3K each) plus approximately 16K benigns (80 percent of post-dedup approximately 20K benign pool) — ratio approximately 1:2 to 1:2.7.

Afterword extensions: larger caps (10K mosscap plus 10K HackAPrompt), quality-filtered HackAPrompt (success-metadata filter; sensitivity vs random), attack-type-stratified subsample (preserves HackAPrompt’s 29-technique diversity per Schulhoff 2023), length-stratified subsample (balanced length-bucket coverage per Phase 1 length audit).

Q7 — Reference-scorer audit for partial disclosure (row 328) — fold-pattern plus scope cross-check

Per-reference-rung audit deliverables (EVIDENCE.md sections 1 plus 2):

Lakera Guard (closed-source, API): - Stated scope claims from public documentation - Per-LODO-fold AUROC pattern — observe whether Lakera scores systematically better on fold X (suggesting fold X source is in Lakera training pool) - Scope cross-check verdict per stated claim: [VERIFIED|UNVERIFIED|REFUTED] - Contamination conclusion documented

ProtectAI deberta-v3-base-prompt-injection (open weights, partial disclosure): - Stated training-data category disclosure from model card (“open-source prompt-injection datasets plus synthetic”) - Per-LODO-fold AUROC pattern — particularly observe whether ProtectAI scores systematically higher on deepset plus Lakera-gandalf folds (likely in its training distribution) - Scope cross-check verdict per stated claim - Contamination conclusion documented

Output: WRITEUP/reference-scorer-audit.md spoke renders per-fold pattern figures plus claim audit verdict; cross-detector fold-pattern correlation matrix flags pairs with r > 0.7 as suggestive of shared training distribution.

Afterword extension: cross-source same-style ablation — train a per-attack-style oracle on each reference scorer’s likely training distribution and compare; requires per-attack-style sample size and substantial additional training; future-work axis.

Phase 1 audit revisit triggers (Assumption A-005 new)

The locked source slate plus splits plus ceilings depend on the Phase 1 empirical data audit confirming the dossier estimates. Specific Phase 1 invalidation triggers — any one of which fires the superseding-ADR requirement:

Benign contamination scan: more than 2 percent of either LMSYS or UltraChat flagged as injection-template-match (MiniLM cosine ≥ 0.85 to a known injection template); — superseding ADR adjusts source mix or filter threshold or substitutes a different benign source
Class-balance: post-dedup per-LODO-fold training-pool class-balance falls outside 1:3 to 1:10 positive:negative range — superseding ADR adjusts subsample ceilings
Per-source labeling-quality: systematic mislabeling detected in any source via spot-check (manual audit of ~50 random samples per source) — superseding ADR drops or revises the affected source
Length-distribution divergence: actual per-source length distribution diverges materially (5x or more on a percentile) from dossier estimates — superseding ADR re-walks Q4 truncation policy (ADR-014) and possibly Q6 length-stratified subsample

evals/data_audit.{contamination,balance,length,labeling}.json are the Phase 1 deliverables that operationalize these triggers.

Afterword commitment

WRITEUP/limitations-and-future-work.md spoke contains a prioritized “Next Iteration Priorities” subsection enumerating the seven afterword extensions surfaced by this ADR plus prior Phase 0-01 ADR-014 plus ADR-015 extensions:

Within-fold variance decomposition (stratified k-fold within LODO per Fomin 2025 plus Nadeau-Bengio 2003)
Broader direct-OOD coverage (HarmBench plus Tensor Trust plus LLMail-Inject)
Adaptive red-team probing (GCG plus PAIR plus TAP per ADR-010 Bound 6)
Matched-context cross-backbone control or larger backbone variants (ModernBERT-large; matched-context DeBERTa-v3 control)
Multilingual extension (per ADR-014 Q2 and ADR-010 Bound 1 — requires native-speaker annotation skill set)
Calibration battery extensions (Phase 0-04 baseline plus temperature plus isotonic ablations)
Production-deployment integration (runtime classifier-in-loop)

This list is updated forward by future ADRs as new extensions are surfaced or completed extensions are crossed off.

Consequences

Positive:

Prototype-scoped methodology — clear claim shape with explicit future-work axes; aligns with ADR-005 Principle 1 (methodology over metrics — depth over breadth at this scope) and Principle 3 (structured limitations with extension conditions)
Bit-reproducibility floor via SHA-pinned source manifest; any reviewer can re-fetch and re-run
LODO methodology measures cross-source generalization directly; honest about distribution-shift
Class balance approximately 1:2.7 is deployment-relevant — not so balanced it ignores production reality, not so imbalanced positive signal starves
Phase 1 audit triggers (A-005) make the source-slate lock empirically falsifiable; revisit path is well-defined
Researcher-auditability constraint surfaced honestly — English-only filter on LMSYS plus all-English eval slate plus dossier-vetted-English sources align coherently
Compute fits A-002 envelope ($60-115 total estimated) — well under revised $25-125 envelope (slight stretch to upper bound acknowledged)
All seven sub-decisions documented in single ADR-016 — reviewer reads one artifact for Phase 0-02 data design; supporting work artifacts (manifest, calibration JSON, audit JSONs, evidence sections) follow from there
Afterword spoke produces a citable “what’s next” artifact; reviewer maps deployment context to extension priorities

Negative / cost:

Loses approximately 99 percent of mosscap plus HackAPrompt content via 3K cap; random subsample may miss rare attack patterns in the discarded portion — explicit limitation; afterword names attack-type-stratified subsample as the extension
Benign contamination scan is mandatory at Phase 1 entry — if contamination rate exceeds 2 percent in LMSYS or UltraChat, the source-slate lock is invalidated and the supersession path activates (real risk; A-005 medium severity)
Phase 1 dedup calibration uses a hand-curated 50-pair holdout — small sample; simplified rule rather than full 4-gate; afterword names the upgrade path
LODO k=4 produces 12 observations per rung; bootstrap CIs have MDE approximately 0.03 Delta-AUROC — honestly characterizes prototype precision; afterword names stratified-k-fold-within-LODO for tighter MDE
HarmBench (ICML 2024 canonical direct-OOD benchmark) is in afterword rather than slate; reviewer may ask “why not HarmBench?” — answer is prototype scope plus dossier-vetted-OOD-already-covers-direct-attack-class via JBB-Behaviors harmful subset
Reference-scorer audit option C deferred — same-style ablation is the gold-standard verdict; option B (fold-pattern plus scope cross-check) is suggestive but not dispositive; explicitly framed as such in EVIDENCE.md and the spoke

Neutral:

ADR-008 brief-level locks preserved unchanged; ADR-016 fills in the methodology specifics ADR-008 deferred
ADR-012 engagement set preserved — HarmBench remains “cite + acknowledge”; afterword commits to upgrading to “replicate” in next iteration
Phase 0-03 still walks §2 Model rows for backbone-size-specifics plus LoRA hyperparameters plus matched-budget controls plus reference-scorer-model-IDs plus compute budget
The 2 hard-locked invariants from SPEC_GREENFIELD §1 (label-aware dedup discipline; leakage detection invariant tests) are preserved and surface as Phase 1 invariant stubs

Alternatives Considered

Path gamma — add HarmBench plus Tensor Trust to OOD slate now: stronger direct-OOD coverage; canonical ICML 2024 benchmark plus human-adversarial diversity. Rejected after prototype-scope reframing because it adds Phase 1 license verification plus dedup plus leakage-scan work for two new sources (~1 full day Phase 1 audit overhead); marginal methodology gain over current 5-source OOD slate for the prototype’s claim shape; afterword names both as next-iteration priorities.
Stratified k-fold within LODO (Option A from Q2 walk): methodologically gold-standard per Fomin 2025 plus Nadeau-Bengio 2003 — proper variance decomposition. Rejected after prototype-scope reframing because 5x compute multiplier (180 runs vs 36 runs) stretches A-002 budget; bootstrap CI shape from 12 observations is adequate for prototype claim precision (MDE approximately 0.03); afterword names this as variance-decomposition extension.
LMSYS-only benigns (Q1 option A): simpler; single-source. Rejected because single-source benign-feature overfitting risk; Phase 1 LODO would conflate detector capacity with benign-source-feature overfitting.
UltraChat-only benigns (Q1 option B): simplest license; no contamination risk. Rejected because pure synthetic loses deployment-distribution realism; reviewer asks “did your detector see real user prompts?” with no honest answer.
3-source benign mix (Q1 option D — LMSYS plus UltraChat plus OpenAssistant): maximum source diversity. Rejected because OpenAssistant adds dedup plus license complexity without clear marginal benefit over LMSYS plus UltraChat mix; afterword can add if Phase 1 reveals class-balance inadequate.
Defer benigns selection to Phase 1 (Q1 option E): lower commitment now. Rejected because violates Phase 0 close criterion (every [OPEN] resolved or explicitly deferred-to-phase-N with ADR); defer-to-phase-1 lock would itself be the lock so frame it that way.
revision=“main” no-pinning (Q3 option D): simplest. Rejected — explicitly banned by ADR-008 and ADR-011 Guarantee 8 (no untracked methodology components).
ADR per HF SHA bump (Q3 option A): maximum auditability. Rejected for prototype as paperwork-on-an-irrelevant-axis; bumping HF SHAs is administrative not methodology-substantive; manifest-history captures the audit trail at lower cost; reserve ADR for schema-change bumps.
MPNet encoder for dedup (Q4 option C): higher-quality semantic embeddings. Rejected because 5x compute cost; near-duplicate detection quality differential is negligible (Reimers 2022); afterword names as upgrade path.
n-gram cosine dedup (Q4 option A): simplest and cheapest. Rejected because misses paraphrases plus surface-form variants — known weakness for prompt-injection corpora where attackers explicitly paraphrase to evade.
Hybrid n-gram plus MiniLM cascade (Q4 option D): high recall on duplicates. Rejected for prototype because two thresholds to calibrate plus more complex implementation; single-encoder simplicity preferred.
MiniLM-MPNet bake-off per SPEC_GREENFIELD default (Q4 option E): empirical evidence drives choice. Rejected because bake-off does not change the prototype-headline result; methodology-paperwork; afterword names as the upgrade path.
After-split dedup (Q5 option d): rejected by ADR-008 — already banned.
Cross-source-first then within-source (Q5 option b): rejected because complicates per-source determinism rule.
Pool-first single dedup pass (Q5 option c): rejected because loses per-source dedup-rate audit.
Open-budget no caps (Q6 option A): rejected because mosscap plus HackAPrompt would dominate training; LODO measures intra-source rather than cross-source; methodologically broken.
Smaller caps approximately 1K positives plus approximately 5K benigns (Q6 option D): faster prototype. Rejected because lower per-fold positive count widens AUROC CIs; reduces statistical power for paired-bootstrap rung-vs-rung; MDE may exceed reportable threshold.
Quality-filtered HackAPrompt subsample (Q6 option E): higher-quality positives. Rejected for prototype because selection bias — “successful” depends on target LLM era; afterword names as sensitivity-analysis extension.
Audit option A only (Q7) fold-pattern only: cheaper but suggestive-only. Rejected because scope cross-check option B is essentially free and produces direct evidence; preferred.
Audit option D (all three including C cross-source same-style ablation) per SPEC_GREENFIELD default: gold-standard. Rejected for prototype because option C requires per-attack-style sample size plus characterization of each reference scorer’s likely training-distribution; substantial additional training work; afterword names as the methodology upgrade.

References

LMSYS-Chat-1M (Zheng et al. 2023, ICLR 2024) — https://arxiv.org/abs/2309.11998
UltraChat (Ding et al. 2023, ICLR 2024) — https://arxiv.org/abs/2305.14233
HF Hub deepset/prompt-injections — https://huggingface.co/datasets/deepset/prompt-injections
HF Hub Lakera/gandalf_ignore_instructions — https://huggingface.co/datasets/Lakera/gandalf_ignore_instructions
HF Hub Lakera/mosscap_prompt_injection — https://huggingface.co/datasets/Lakera/mosscap_prompt_injection
HF Hub hackaprompt/hackaprompt-dataset — https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset
HF Hub lmsys/lmsys-chat-1m — https://huggingface.co/datasets/lmsys/lmsys-chat-1m
HF Hub HuggingFaceH4/ultrachat_200k — https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
HF Hub leolee99/NotInject — https://huggingface.co/datasets/leolee99/NotInject
HF Hub JailbreakBench/JBB-Behaviors — https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors
Sentence-Transformers all-MiniLM-L6-v2 — https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Reimers and Gurevych 2019 “Sentence-BERT” — https://arxiv.org/abs/1908.10084
Ethayarajh 2019 “How Contextual are Contextualized Word Representations?” (anisotropic embeddings) — https://arxiv.org/abs/1909.00512
InjecGuard NotInject (Li and Liu 2024) — https://arxiv.org/abs/2410.22770
HackAPrompt (Schulhoff et al. 2023, EMNLP) — https://arxiv.org/abs/2311.16119
Nadeau and Bengio 2003 “Inference for the Generalization Error” — https://link.springer.com/article/10.1023/A:1024068626366
ADR-005 (Principles 1 and 2 and 3 — methodology over metrics, honest evaluation preferred, structured limitations with extension conditions)
ADR-006 (3-seed floor preserved; paired-bootstrap protocol)
ADR-007 superseded by ADR-015 (rung architecture)
ADR-008 (data scope brief-level locks — public-only, hybrid splits, NotInject inclusion)
ADR-011 (methodology guarantees — leakage scan plus per-row predictions plus threshold-on-validation)
ADR-012 (engagement set preserved — HarmBench remains cite-and-acknowledge; afterword commits to upgrade)
ADR-014 (threat-model bundle — Q1 direct-primary plus indirect-zero-shot OOD shapes the BIPIA-as-zero-shot lock here)
ADR-015 (rung architecture refinement — ModernBERT-base only; references the 3 trained rungs at the 36-run total)

Transcript

See transcripts/2026-05-15__phase-0-02__data-design.md for the conversation that led to this decision.