Phase 1 (Data) implementation bundle — manifest schema, SHA pinning, validator placement, loader arch, holdout sourcing, contamination corpus, output materialization
ADR-041: Phase 1 (Data) implementation bundle — manifest schema, SHA pinning, validator placement, loader arch, holdout sourcing, contamination corpus, output materialization
Status
Accepted (2026-05-16).
Context
ADR-016 locked the §1 Data methodology bundle (source slate plus splits plus HF pinning plus dedup encoder plus ordering plus ceilings plus reference-scorer audit) at Phase 0-02. ADR-026 locked the §5 Code architecture module-layout taxonomy at Phase 0-06. Both ADRs deferred several implementation specifics to Phase 1 entry:
- ADR-016 §Q3 said “HF dataset revisions plus GitHub commit SHAs pinned at Phase 1 entry in unified data/source_manifest.yaml” but did not specify the manifest schema shape, SHA discovery method, or revision-bump workflow.
- ADR-016 §Q4 said “50-pair labeled holdout” for dedup calibration but did not specify the holdout sampling strategy or labeling method.
- ADR-026 §Decision tree listed loaders.py + dedup.py + splits.py with a comment line
# source loaders, dedup, LODO splits, manifest validation, leaving manifest_validation.py placement implied-but-not-explicit. - Assumption A-006 (contamination scan) said “known public training-data mirrors” but did not specify the reference corpus.
A post-Phase-0 /exploring-options Phase-1 walk surfaced 7 implementation decision points, posed them with options + pros/cons + recommendation, and the user ratified the 7 recommendations as a bundle. This ADR documents the 7 locks.
Sub-decisions Q1 plus Q2 plus Q3 plus Q4 plus Q7 are implementation specifics that flow directly from ADR-016 + ADR-026 deferred-to-Phase-1 surfaces — they do not introduce new methodology beyond what ADR-016 + ADR-026 imply.
Sub-decisions Q5 plus Q6 are methodology refinements — Q5 specifies the holdout sampling strategy beyond ADR-016 §Q4’s “50-pair labeled holdout”; Q6 operationalizes A-006’s “known public training-data mirrors” as a specific (slate plus templates) corpus blend. These two sub-decisions warrant ADR documentation under the project anti-pattern rule against “Adding a methodology component without an ADR.”
Decision
Q1 — Manifest schema shape (rich = option B)
data/source_manifest.yaml carries the rich schema at schema_version: "1.0". Top-level keys are schema_version plus generated_at_utc plus pin_method plus adr_ref plus bump_history plus sources. The sources list contains 11 rows, each with 13 fields:
| Field | Type | Purpose |
|---|---|---|
name |
str | Internal source key (snake_case; e.g., lmsys_chat_1m) |
hf_id |
str | Upstream identifier (HF repo or GitHub repo path) |
type |
str | hf_dataset or git_repo |
revision_sha |
str | Pinned commit/revision SHA (40-char preferred) |
license |
str | License identifier (e.g., CC-BY-4.0) |
role |
str | train_positive, train_benign, or ood_eval |
expected_n |
int | Post-dedup row-count estimate from dossier |
cap |
int or None | Subsample ceiling per ADR-016 Q6 (None = use all) |
selection_seed |
int or None | Seed for random subsample (None if cap is None) |
language_filter |
str or None | ISO 639-1 code (e.g., en for LMSYS-Chat-1M) |
subset |
str or None | HF dataset config/subset (e.g., behaviors for JBB) |
split |
str or None | HF split name (e.g., train, train_sft) |
citation_arxiv |
str or None | arXiv ID for provenance (e.g., 2309.11998) |
Validator src/data/manifest_validation.py enforces presence + types + role-count mismatch + slate-completeness — validate_manifest(path) returns the parsed dict on success or raises ManifestSchemaError with field-level context.
Q2 — Live-fetch SHAs (option A)
scripts/pin_source_manifest.py is the one-time + bump-driven entry point. HF datasets resolve via huggingface_hub.HfApi.dataset_info(repo_id).sha; GitHub-cloneable sources (xstest, BIPIA, InjecAgent) resolve via subprocess.run(["git", "ls-remote", url, "HEAD"]). HF_TOKEN auto-discovers from environment (per ADR-035) for gated datasets (lmsys-chat-1m).
Re-runs are idempotent — fresh SHAs vs prior manifest SHAs are compared; identical = no-op; mismatch = SHAMismatchError unless --force flag records a bump_history entry. bump_history schema per entry: {bumped_at_utc, trigger, diffs: [{name, prior, fresh}]}.
Q3 — manifest_validation.py placement (option A)
The validator lives at src/data/manifest_validation.py. ADR-026 §Decision tree’s comment line for src/data/ reads # source loaders, dedup, LODO splits, manifest validation — manifest validation is explicitly listed as a src/data/ concern. The 3 files named under the tree (loaders.py, dedup.py, splits.py) are illustrative not exhaustive. No ADR-026 supersession needed; adding manifest_validation.py is consistent with the locked taxonomy.
Q4 — Loader architecture (option D — HF dispatch + per-source normalize family)
src/data/loaders.py exposes a single load_source(name: str) -> pandas.DataFrame dispatch function. The function reads data/source_manifest.yaml to resolve (hf_id, type, revision_sha, split, subset, language_filter, cap, selection_seed) per source; calls datasets.load_dataset(hf_id, name=subset, split=split, revision=revision_sha) for HF datasets; clones + reads + parses CSV/JSON for GitHub sources; dispatches to a private _normalize_<source>(ds) helper for column normalization + label encoding (each source has bespoke column names — deepset uses text/label; mosscap uses prompt/tag; etc.).
HF default cache (~/.cache/huggingface/datasets/) handles repeat-fetch acceleration. For RunPod-pod consumption, HF_HOME may point at a persistent volume to survive cold-start re-fetches; for laptop usage, the default cache suffices.
The English-only language filter for lmsys_chat_1m is applied inside _normalize_lmsys_chat_1m via the language column (post-Phase-1 audit may surface that LMSYS-Chat-1M has language tags or requires fasttext-langdetect — choice documented in decisions/library_imports.md).
Q5 — Stratified-cosine-band holdout (option A)
The 50-pair dedup calibration holdout is built via stratified-cosine-band sampling:
- Compute pairwise MiniLM-L6-v2 cosines within each of the 4 train-positive sources (deepset, lakera-gandalf, mosscap, hackaprompt).
- From each source’s within-source pair distribution, sample candidate pairs in 5 cosine bands:
[0.95, 1.0),[0.85, 0.95),[0.75, 0.85),[0.65, 0.75),[0.55, 0.65). - Take 5 candidate pairs from each band (25 pairs total across the bands) and another 5 random pairs uniformly across the range — round-robin across sources to spread coverage; total 25 + 25 = 50.
- Brandon labels each pair by visual inspection —
true_duplicate: bool(semantic-near-duplicate or surface-form-near-duplicate per ADR-016 Q4 dedup spec). - Persist to
data/dedup_holdout.jsonl(gitignored — contains source content) plus a SHA-256 digest of the JSONL bytes recorded inevals/dedup_calibration.jsonfor tamper-evident provenance.
The 5-band design ensures the FPR + FNR measurement at the locked 0.80 threshold (per ADR-016 §Q4) actually probes the decision boundary — without banding, the random sample would be dominated by the low-cosine mass and report uninformative FPR/FNR ≈ 0 / ≈ 0.
Q6 — Contamination-scan reference corpus (option B — slate + ~200 templates)
evals/contamination_scan.json checks per-eval-row max-cosine to a two-component reference corpus:
- Cross-source slate: each OOD eval row’s max-cosine across the 4 train-positive source pools (post-dedup). Catches the realistic threat — did this OOD row appear (or near-paraphrase appear) in a train-positive pool?
- Extracted injection templates (approximately 200): templates extracted from HackAPrompt success-pattern metadata — Schulhoff 2023 catalogues 29 prompt-injection attack techniques; for each technique we extract approximately 7 templates from observed-success patterns. Templates capture the canonical injection-template space; catches near-paraphrase contamination with the broader injection-template distribution.
Per-row scan output: {eval_source, eval_row_idx, max_cosine_slate, max_cosine_templates, max_cosine_overall, flagged: max_cosine_overall >= 0.85}.
Generic-web-text mirrors (The Pile, C4, RedPajama) are deferred to afterword. ADR-016 compute envelope (approximately $60-115 total) did not price in The-Pile-sized scans; the realistic threat model for prompt-injection eval contamination is template-level overlap not generic-web-text overlap. If reviewer pushes back, point to the future-work spoke + the afterword commitment.
Q7 — Output materialization + Makefile execution surface (option A)
Post-dedup + post-split outputs materialize as per-fold parquet:
data/processed/
├── fold-0/seed-42/{train,val,test}.parquet
├── fold-0/seed-43/{train,val,test}.parquet
├── fold-0/seed-44/{train,val,test}.parquet
├── fold-1/...
├── ... (48 parquet files total — 4 folds × 3 seeds × 3 roles)
└── index_masks/
└── fold-N__seed-S__{train,val,test}.npy
Each parquet carries columns (text, label, source, row_idx_in_source). Index masks (numpy.ndarray[int64]) persist the source-row indices for reverse-trace at audit time (e.g., evals/leakage_report.json cites specific row pairs that would-have-been-leaks-pre-dedup).
Makefile gains 5 granular Phase 1 targets plus 1 umbrella:
make data-fetch— runspin_source_manifestthen materializes raw per-source rows todata/raw/<source>.parquet(gitignored).make data-dedup— runs MiniLM dedup + cross-source LMSYS-priority pass; materializes deduped pool todata/processed/deduped_pool.parquet.make data-splits— runs LODO × seeds × stratified 80/20; materializes 48 per-fold parquet files.make data-audit— runs counts + class-balance + leakage + contamination scans; writesevals/{data_audit,leakage_report,contamination_scan,dedup_calibration}.json.make data-prepare— umbrella that runs all four in order; idempotent; outputs are timestamp-checked.
Consequences
Positive:
- Manifest is grep-able and reviewer-readable — every per-source decision (cap, seed, language filter, split, citation) is visible in one file.
- SHA pinning at Phase 1 entry plus idempotent re-runs plus explicit
bump_historyworkflow gives a tamper-evident audit trail; ADR-036 bump-trigger policy operationalizes. - Manifest validator + invariant test
test_source_manifest_schema_validenforces the rich schema at CI — no silent drift in manifest shape. - Single-file loader (
loaders.pywith_normalize_*helpers) keeps per-source quirk-handling local without splitting into 11 files. - Stratified-cosine-band holdout produces an actually-informative FPR + FNR at the 0.80 threshold (vs random sample which would be uninformative).
- Contamination corpus blend (slate + templates) covers the realistic threat (template-level overlap) at proportionate cost; generic-pretraining-mirror extension is named for future work.
- Per-fold materialized parquet is the shape ADR-015 + ADR-017 trainer consumes directly; granular Makefile targets enable partial rebuilds during Phase 1 debugging.
Negative / cost:
- Live-fetching SHAs requires network at Phase 1 entry; HF_TOKEN required for gated datasets (lmsys-chat-1m may require accepting the dataset card; document in
decisions/library_imports.md). - 48 parquet files plus dedup pool plus raw cache approximately 500 MB on disk under
data/; all gitignored. - Stratified-cosine-band labeling consumes approximately 1-2 hours of researcher time (Brandon hand-labels); offset by producing a defensible FPR + FNR measurement.
- Template-extraction from HackAPrompt success-pattern metadata requires processing HackAPrompt-success-rows; one-time cost approximately 30 min; templates persist as
data/contamination_templates.parquet(gitignored). - Manifest_validation.py adds a 4th file to
src/data/beyond ADR-026 §Decision tree’s explicit 3 — defensible per the §Decision tree comment line listing manifest validation as a concern; documented in this ADR’s §Q3.
Neutral:
- ADR-016 + ADR-026 locks preserved unchanged; ADR-041 fills in deferred implementation specifics. No supersession.
- Six new Phase 1 deps added across commits 1-4 (huggingface_hub, datasets, sentence-transformers, pandas, pyarrow, scikit-learn); declared in
pyproject.toml[project.dependencies]; locked viauv.lockat submission tag. - Three invariant test stubs unskip across commits 1 + 3 + 5 — schema-valid, dedup-calibration, contamination-clean.
Alternatives Considered
- Q1 option A (minimal 5-field schema): rejected — loses provenance (citation, dossier reference) and ADR-016 Q6 cap rationale visibility.
- Q1 option C (two-tier inline minimal + sidecar detail): rejected — two files to keep in sync;
bump_historyharder to maintain across files. - Q2 option B (read from dossier MANIFEST.json): rejected — dossier doesn’t record every source SHA; staleness risk.
- Q2 option C (pin to “main” tag): explicitly banned by ADR-016 §Q3 + ADR-011 Guarantee 8.
- Q3 option B (manifest_validation in src/utils/): rejected — schism; manifest is a data concern; loaders.py would import from utils/.
- Q3 option C (scripts/validate_manifest.py only): rejected — no reusable validator class; tests cannot import.
- Q3 option D (inline inside loaders.py): rejected — couples loading with validation; harder to test isolated.
- Q4 option A (one load_source dispatch, no normalize family): rejected — each HF dataset has bespoke column shapes; central dispatch without per-source helpers would balloon into a giant if/elif tree inside one function.
- Q4 option B (separate parquet materialization for raw): rejected — raw pre-dedup is throwaway; only post-dedup pool plus per-fold splits warrant materialization.
- Q4 option C (per-source classes): rejected — more boilerplate; ADR-026 guideline (≤50 LoC per module) tighter as a class-per-source pattern.
- Q5 option B (random 50 pairs): rejected — random sample dominated by low-cosine pairs; uninformative FPR/FNR around the 0.80 threshold.
- Q5 option C (adversarial paraphrase-generator pairs): rejected — synthetic positives are not in-distribution; calibration may be optimistic.
- Q5 option D (LLM-judge labels 100 pairs + spot-check 20): rejected — judge prior may contaminate calibration; near-duplicate task quality on judges is unknown.
- Q6 option A (slate only): rejected — doesn’t catch near-paraphrase contamination with the broader injection-template distribution.
- Q6 option C (slate + templates + The Pile / C4 / RedPajama sample): rejected for prototype — compute envelope (approximately $60-115) didn’t price in The-Pile-sized scans; deferred to afterword.
- Q6 option D (slate-only minimal): rejected — under-specifies A-006’s “known public training-data mirrors” wording.
- Q7 option B (single deduped store + index masks only): rejected — trainer has to apply masks at load; harder for reviewer to spot-check a single fold.
- Q7 option C (per-fold parquet + single umbrella Make target only): rejected — slower Phase 1 debugging iteration; can’t partially rebuild.
References
- ADR-016 (data design bundle — locks methodology surface)
- ADR-026 (module layout — locks
src/data/taxonomy including “manifest validation” comment) - ADR-036 (library version pins + bump-trigger policy)
- ADR-040 (Phase 0 audit findings + assumption backfill — A-006 contamination assumption)
- huggingface_hub HfApi docs — https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api
- HF datasets load_dataset docs — https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods
- Sentence-Transformers pretrained models — https://www.sbert.net/docs/pretrained_models.html
- Reimers and Gurevych 2019 “Sentence-BERT” — https://arxiv.org/abs/1908.10084
- HackAPrompt (Schulhoff et al. 2023, EMNLP) — https://arxiv.org/abs/2311.16119
Transcript
See transcripts/2026-05-16__phase-1-implementation.md for the conversation that led to this decision (post-Phase-0-compaction /exploring-options Phase-1 walk).