Post-split cross-source leakage cleanup — drop train+val rows that exact-match or cosine-near-match test
ADR-043: Post-split cross-source leakage cleanup — drop train+val rows that exact-match or cosine-near-match test
Status
Accepted (2026-05-16).
Context
ADR-016 §Q3 locked a hard invariant: “no exact-hash and no high-cosine train-test overlap” via the leakage scan deliverable. ADR-016 §Q4 locked within-source dedup at MiniLM cosine 0.80. ADR-016 §Q5 locked cross-source dedup BUT only for benigns (LMSYS + UltraChat) with LMSYS-priority tiebreak. Cross-source dedup for POSITIVES was not specified because the LODO splits inherently treat each positive source as a unit: the held-out source goes entirely to test; the other 3 positive sources go to train+val. The implicit assumption was that source-level disjointness suffices.
Phase 1 Commit 5 pipeline run surfaced the empirical reality:
| LODO held-out source | Exact-hash overlaps | Cosine ≥0.85 overlaps |
|---|---|---|
| deepset (fold 0) | 0 | 2 per seed × 3 seeds = 6 |
| lakera_gandalf (fold 1) | 0 | 16 per seed × 3 = 48 |
| lakera_mosscap (fold 2) | 0 | 30 per seed × 3 = 90 |
| hackaprompt (fold 3) | 2 per seed × 3 = 6 | 7 per seed × 3 = 21 |
| Total across 12 splits | 6 | 165 |
The pattern is structural — cross-source positive near-paraphrases exist (especially between Lakera datasets and HackAPrompt, where successful injection patterns converged). LODO test of, say, mosscap, contains rows whose near-paraphrases sit in HackAPrompt’s train data — model has effectively “seen” the test pattern.
This is a methodology gap that needs closing before Phase 2 training begins (training on leaked data would inflate test scores).
Decision
Add a post-split leakage cleanup pass:
src/data/dedup.py::drop_train_test_leakage(train_val_df, test_df, threshold=0.85)— scans train+val vs test; drops train+val rows that:- Exact-match any test text, OR
- Have MiniLM cosine ≥ threshold (0.85) to any test row. Returns cleaned train+val + per-pair drop records (text snippets + cosines + reason).
src/data/splits.py::apply_leakage_cleanup(splits, threshold=0.85)— appliesdrop_train_test_leakageto each FoldSeedSplit; re-partitions cleaned train+val at the same 80/20 ratio thatmake_splitsused. Returns cleaned splits + per-split cleanup-record list.scripts/run_data_pipeline.pyinvokesapply_leakage_cleanupbetweenmake_splitsandmaterialize_splits. Pipeline log recordsn_droppedper (fold, seed) for audit.
Threshold 0.85 matches compute_leakage_report in src/data/audit.py (per ADR-016 Q3 hard-locked invariant). Test pool is preserved intact (test = held-out source’s full deduped pool).
ADR-016 Q5’s cross-source BENIGN dedup (LMSYS-priority tiebreak) is preserved unchanged — ADR-043 specifically adds POST-SPLIT cleanup for the train+val vs test text-overlap surface, not before-split cross-source positive dedup. The reason for post-split rather than before-split — LODO splits define the test pool per fold; cross-source positive dedup before splitting would require a notion of “which fold” the dedup applies to (different folds want different rows kept). Post-split per-fold cleanup is clearer.
Consequences
Positive:
- ADR-016 Q3 hard-locked leakage invariant satisfied — post-cleanup
leakage_clean = Trueinevals/leakage_report.json. - Methodology gap surfaced + closed in same commit; honest about the empirical finding.
- Cleanup records persist in pipeline log; reviewer can trace exactly which rows were dropped + why.
- Test pool unchanged — LODO held-out source full pool stays intact; only train+val rows that overlap with test are dropped.
- Cost is small — approximately 171 rows out of ~22,000 train+val pool per split (~0.8%).
Negative / cost:
- Train pool shrinks slightly (~3.6% of positive pool dropped across 12 splits; benigns unaffected because they aren’t in test per LODO design).
- Implicit super-session of ADR-016 Q5’s silence on cross-source positive overlap; documented here.
- Adds one O(N²) scan per split (already required by leakage scan); compute envelope unchanged.
Neutral:
- ADR-016 Q4 within-source dedup at 0.80 preserved unchanged.
- ADR-016 Q5 cross-source benign dedup at 0.80 with LMSYS-priority preserved unchanged.
- ADR-041 Q7 per-fold parquet materialization layout preserved unchanged (just fewer rows per train+val parquet).
Alternatives Considered
- Before-split cross-source positive dedup: rejected — cross-source positive dedup requires a notion of “which fold” since LODO assigns sources to test differently per fold. Post-split per-fold cleanup is clearer and yields the same end-state with less ambiguity.
- Tolerate the leakage with disclosure: rejected — ADR-016 Q3 says “no exact-hash and no high-cosine train-test overlap” is a HARD-LOCKED invariant. Tolerating ~0.25% per-split leakage would compromise the locked invariant.
- Drop EXACT-hash only; tolerate cosine: rejected — same hard-lock rationale. The cosine threshold (0.85) is the same as the leakage scan threshold; consistency matters.
- Higher cleanup threshold (0.90): rejected — would leave the 0.85-0.90 band of near-paraphrases as leaked. The 0.85 threshold is the locked leakage threshold per ADR-016 Q3.
- Drop test rows instead of train rows: rejected — test = held-out source’s full pool per LODO Q1; dropping test rows would shrink the test set and violate LODO’s “test on held-out source” definition. Dropping train rows preserves LODO test integrity.
References
- ADR-016 (data design bundle — Q3 hard-locked leakage invariant + Q4 within-source dedup + Q5 cross-source benign dedup)
- ADR-041 (Phase 1 implementation bundle — Q7 per-fold parquet materialization)
- ADR-042 (LLM-pre-label dedup-holdout bootstrap)
Transcript
See transcripts/2026-05-16__phase-1-implementation.md for the conversation that led to this decision (post-Commit-5 pipeline empirical finding).