# Leakage > **Background** *(skip if you've internalized this)*. Leakage is when an > evaluation procedure lets information about the test set influence the > model — directly (a row appears in both train and test), indirectly > (correlated rows; same patient in both folds), or structurally (the > model's feature pipeline saw test data during fitting). It inflates > reported metrics by amounts that often exceed the gap between papers. > Kapoor & Narayanan ([2023](https://arxiv.org/abs/2207.07048)) document > 294 papers across 17 fields where leakage was the cause of > non-replication. Treat leakage detection as a *first-class* eval step, > not a one-off audit. This chapter is a taxonomy of leakage classes, the eval-toolkit primitive that catches each one, and the pitfalls that make leakage hard to spot. Every code block runs under [Sybil](https://sybil.readthedocs.io/) — copy them as starting points. ## Setup ```python import pandas as pd from eval_toolkit import EvalSlice, run_leakage_checks ``` We'll reuse two small split fixtures throughout: ```python clean_train = pd.DataFrame( {"text": [f"unique_train_{i}" for i in range(20)], "label": [i % 2 for i in range(20)]} ) clean_test = pd.DataFrame( {"text": [f"unique_test_{i}" for i in range(10)], "label": [i % 2 for i in range(10)]} ) clean_splits = { "train": EvalSlice(name="train", df=clean_train), "test": EvalSlice(name="test", df=clean_test), } ``` ## Taxonomy The eight classes below cover what consumers of this toolkit will hit in binary-classification work. Subclasses (e.g., temporal, group, transfer- learning) are grouped under the parent that catches them. | # | Class | Toolkit primitive | Severity default | |---|---|---|---| | 1 | Exact duplicate | `ExactDuplicateCheck` | warning | | 2 | Near duplicate | `NearDuplicateCheck` | warning | | 3 | Encoding-obfuscated duplicate | `NormalizedFormLeakageCheck` | error | | 4 | Cross-split (train↔eval) | `CrossSplitLeakageCheck` | error | | 5 | Label conflict | `LabelConflictCheck` | error | | 6 | Group leakage | `GroupLeakageCheck` | error | | 7 | Temporal leakage | `TemporalLeakageCheck` | error | | 8 | Target / pretraining / distribution shift | (out of toolkit scope) | n/a | (exact-duplicates)= ### 1. Exact duplicates **Definition.** Two rows whose normalized text is identical (same string after Unicode-NFC + whitespace collapse + casefold). **Harm.** Within-split duplicates inflate observed metrics by the amount of *redundancy* in your data: a duplicated rare positive double-counts in the recall numerator. Across-split duplicates are a bigger issue — see class 4. **Primitive.** [`ExactDuplicateCheck`](../api/leakage.md) wraps `text_dedup.ExactNormalizedHashStrategy` (whitespace-normalized SHA-256 buckets). Default severity is `"warning"` because exact dupes within a split are common in real corpora; opt into `"error"` for strict mode. ```python from eval_toolkit import ExactDuplicateCheck dup_train = pd.DataFrame( {"text": ["hello world", "hello world", "unique"], "label": [0, 0, 1]} ) dup_splits = {"train": EvalSlice(name="train", df=dup_train)} finding = ExactDuplicateCheck().validate(dup_splits) print(f"severity={finding.severity} n_affected={finding.n_affected}") print(f"drop_indices={finding.drop_indices}") ``` (near-duplicates)= ### 2. Near duplicates **Definition.** Pairs of rows whose textual similarity exceeds a threshold under some sense — TF-IDF cosine, embedding cosine, MinHash Jaccard. **Harm.** Same as exact duplicates but with a tunable strictness. Common sources: paraphrased social-media posts, scraped duplicates with minor HTML differences, multiple translations of the same source. **Primitive.** [`NearDuplicateCheck`](../api/leakage.md) takes a `strategy: SimilarityStrategy` and a `threshold` — pluggable so the *sense of similarity* is opt-in. Defaults to `TfidfCosineStrategy` at 0.9. ```python from eval_toolkit import NearDuplicateCheck paraphrases = pd.DataFrame( {"text": [ "the quick brown fox jumps", "the quick brown fox jumps!", "lorem ipsum dolor sit amet", ], "label": [0, 0, 1]} ) splits = {"train": EvalSlice(name="train", df=paraphrases)} finding = NearDuplicateCheck(threshold=0.8).validate(splits) print(f"caught {finding.n_affected} near-dupe rows") ``` For semantic dedup, swap the strategy: ```python # from eval_toolkit import EmbeddingCosineStrategy # strategy = EmbeddingCosineStrategy(embedder=my_sentence_transformer) # NearDuplicateCheck(strategy=strategy, threshold=0.85) ``` (encoding-obfuscation)= ### 3. Encoding-obfuscated duplicates (NEW in v0.7.0) **Definition.** Rows that *look different* but normalize to the same text under aggressive Unicode transforms — NFKC + zero-width-strip + Symbol-Other-strip (most emoji) + casefold. **Harm.** This is the **dominant unfixed leakage class in prompt-injection corpora**. The PI_HackAPrompt_SQuAD work (2025) reports that encoding-obfuscated duplicates detect at only **21.3 %** under naive dedup but achieve **76.2 %** attack success rate against deployed classifiers. If your eval set has zero-width-padded copies of your train set, you are measuring nothing. **Primitive.** [`NormalizedFormLeakageCheck`](../api/leakage.md) applies the aggressive normalization before hashing. Default severity is `"error"` — encoding-obfuscated overlap is dangerous enough that you should opt *out* (to `"warning"`) only when upstream cleaning already handles it. ```python from eval_toolkit import NormalizedFormLeakageCheck # Same string, with zero-width characters injected. obfuscated = pd.DataFrame( {"text": ["hello world", "h​e​llo world", "unrelated"], "label": [0, 1, 0]} ) splits = {"test": EvalSlice(name="test", df=obfuscated)} finding = NormalizedFormLeakageCheck().validate(splits) print(f"caught {finding.n_affected} obfuscation collision(s)") ``` > **What NOT to do.** Don't rely on `text.lower().strip()` for > prompt-injection eval. The class of attacks documented in OWASP > [LLM01:2025](https://genai.owasp.org/llmrisk/llm01-prompt-injection/) > includes zero-width injection, emoji-padded payloads, alt-encoding — > all defeated only by NFKC + Symbol-Other strip. (cross-split)= ### 4. Cross-split (train ↔ eval) leakage **Definition.** A row in your test slice that is identical or near- identical to a row in your training slice. **Harm.** The canonical leakage. Your model has memorized the test point; your "test metric" is a memorization metric. Inflations of 5–30 % PR-AUC have been reported on text classification benchmarks where dedup was skipped. **Primitive.** [`CrossSplitLeakageCheck`](../api/leakage.md) wraps `text_dedup.cross_dedup`. Default severity `"error"` — this is the genuinely dangerous one. ```python from eval_toolkit import CrossSplitLeakageCheck train = pd.DataFrame({"text": ["hello world a longer string", "lorem"], "label": [0, 1]}) test = pd.DataFrame({"text": ["hello world a longer string", "different"], "label": [1, 0]}) splits = { "train": EvalSlice(name="train", df=train), "test": EvalSlice(name="test", df=test), } finding = CrossSplitLeakageCheck(train_split="train").validate(splits) print(f"{finding.n_affected} test row(s) leak from train") ``` (label-conflict)= ### 5. Label conflict **Definition.** The same (or near-same) text appearing with *different* labels across slices, or across the source datasets you concatenated to build a slice. **Harm.** Annotator disagreement aside, label conflicts mean your eval is ambiguous. The same prompt being labeled "injection" in train and "benign" in test poisons the evaluation regardless of which label is "correct". **Primitive.** [`LabelConflictCheck`](../api/leakage.md). Replaces the cross-source conflict resolution that `prompt-injection-sdd` and `prompt_injection_detector` reimplement (~50 LOC each). ```python from eval_toolkit import LabelConflictCheck train = pd.DataFrame({"text": ["x", "y"], "label": [0, 1]}) test = pd.DataFrame({"text": ["x", "z"], "label": [1, 0]}) # conflict on "x" splits = { "train": EvalSlice(name="train", df=train), "test": EvalSlice(name="test", df=test), } finding = LabelConflictCheck().validate(splits) print(f"{finding.n_affected} conflicting rows across {len(splits)} splits") ``` (group-leakage)= ### 6. Group leakage **Definition.** A grouping unit (patient, user, document, source) that appears in more than one split. **Harm.** Within-group correlation makes the test metric a *memorization of the group*, not a generalization measure. The classic medical-imaging failure: same patient's images in train and test → near-100 % accuracy that vanishes on truly held-out patients. **Primitive.** [`GroupLeakageCheck`](../api/leakage.md) takes a `group_col`. Severity `"error"`. Use with [`GroupKFoldSplitter`](../api/splits.md) or [`SourceDisjointKFoldSplitter`](../api/splits.md) — see [splits.md](splits.md). ```python from eval_toolkit import GroupLeakageCheck train = pd.DataFrame( {"text": ["a", "b", "c"], "label": [0, 1, 0], "group_id": [1, 2, 3]} ) test = pd.DataFrame( {"text": ["d", "e"], "label": [1, 0], "group_id": [1, 4]} # group 1 spans ) splits = { "train": EvalSlice(name="train", df=train), "test": EvalSlice(name="test", df=test), } finding = GroupLeakageCheck(group_col="group_id").validate(splits) print(f"groups spanning splits: {sorted(finding.evidence['violating_groups'].keys())}") ``` (temporal-leakage)= ### 7. Temporal leakage **Definition.** Train data with timestamps later than test data — the model is allowed to "see the future" during training. **Harm.** Temporal correlation: features are non-stationary, label distributions drift, the test metric measures interpolation rather than forecasting. Recently formalized by Yan et al. ([Hidden Leaks in Time Series Forecasting, 2025](https://arxiv.org/html/2512.06932v1)) which catalogs LSTM rolling-window leakage and validation-strategy leakage as distinct subclasses. **Primitive.** [`TemporalLeakageCheck`](../api/leakage.md) — given a `time_col` and `split_order`, asserts every earlier split's `max(time)` ≤ next split's `min(time)`. ```python from eval_toolkit import TemporalLeakageCheck train = pd.DataFrame({"text": ["a", "b"], "label": [0, 1], "t": [10, 20]}) test = pd.DataFrame({"text": ["c", "d"], "label": [1, 0], "t": [5, 15]}) # bad splits = { "train": EvalSlice(name="train", df=train), "test": EvalSlice(name="test", df=test), } finding = TemporalLeakageCheck( time_col="t", split_order=("train", "test") ).validate(splits) print(f"violations={len(finding.evidence['violations'])}") ``` (target-leakage)= ### 8. Target / pretraining / distribution-shift leakage **Definition (umbrella).** Three related classes that the toolkit *does not* try to detect automatically: - **Target leakage**: a feature trivially predicts the label because it's computed *after* the label is determined (e.g., "user clicked 'unsubscribe'" as a feature for a churn label). - **Transfer-learning / pretraining-corpus leakage**: a pretrained model has already seen part of your eval set during pretraining (Don't Push the Button, [Pellizzoni et al. 2025](https://link.springer.com/article/10.1007/s10462-025-11326-3)). Common with HuggingFace pretrained transformers and public benchmarks. - **Distribution shift**: the production population differs from your eval population in ways that change the metric's interpretation (covariate shift, label shift, concept drift; Recht et al. 2019). **Why the toolkit doesn't catch these.** Target leakage requires domain-specific feature semantics. Pretraining leakage requires access to the pretraining corpus. Distribution shift requires the production distribution. All three are *consumer-side* concerns, not eval-toolkit concerns. The pointers above are required reading. (pytorch-pitfalls)= ## PyTorch & transformer-specific pitfalls ### Tokenization-level duplicates Two strings can look distinct in plain text but tokenize identically once a transformer's BPE / SentencePiece / WordPiece tokenizer is applied. Don't dedup on raw text only when your downstream model is a transformer — dedup on the tokenizer's output too. Since v0.37.0 the toolkit ships `TokenizationLeakageCheck` for this directly. It takes a tokenizer callable (any HuggingFace `PreTrainedTokenizerBase`, or any `Callable[[str], Mapping]` returning HF-style `{"input_ids": [...]}` output) and dedups on the `input_ids` tuple per row: ```text from transformers import AutoTokenizer from eval_toolkit.leakage import TokenizationLeakageCheck tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") check = TokenizationLeakageCheck(tokenizer=tokenizer) finding = check.validate(splits) ``` Optional install via the `[transformers]` extra (intentionally **not** in `[all]` / `[dev]` — transformers transitively pulls torch (~700MB) per the [embeddings] precedent). The check itself does not import transformers; consumers pass an already-instantiated tokenizer. **Pin the tokenizer for audit reproducibility.** Different `transformers` releases can emit different `input_ids` for the same text (added-vocab changes between minors), which would silently flip dedup outcomes. Capture both the package version and a SHA-256 of the `tokenizer.json` in your `RunManifest`. For consumers who want to avoid the `[transformers]` extra entirely: emulate the check by adding a `tokens` column to your dataframe and pointing `ExactDuplicateCheck` at it via a custom feature view on the `EvalSlice`. The `TokenizationLeakageCheck` is just sugar over that pattern with explicit tokenizer-pin guidance. ### Pretraining contamination Public benchmarks (GLUE, SuperGLUE, MMLU, even niche prompt-injection corpora) often appear verbatim in pretraining corpora like Common Crawl. A "9X.X % accuracy" on a public benchmark with a public-pretrained backbone is rarely a generalization claim. Two mitigations: 1. **Audit publication dates.** A model pretrained in 2024-Q3 has seen anything published before then. Use eval sets curated *after* your model's pretraining cutoff. 2. **Use embedding-space dedup.** [`EmbeddingCosineStrategy`](../api/text_dedup.md) wrapped in a `NearDuplicateCheck` lets you flag eval rows that embed-near any train row, including rough paraphrases. Doesn't catch pretraining contamination but *does* catch your own train-set contamination of an eval set. (leakage-pitfalls)= ## Pitfalls / Common mistakes - **Running checks AFTER training.** By then the harm is done. Run leakage checks at *load time*, before any model fits anything. The recommended pattern is to pass `leakage_checks=[...]` to [`evaluate(...)`](../api/harness.md) which fails the run on `error`-severity findings before scoring starts. - **Trusting per-source dedup as cross-source dedup.** Two corpora that each look clean can still leak across each other. Always run `CrossSplitLeakageCheck` between every (train, eval) pair, not just within each. - **Picking thresholds without understanding the strategy.** A `TfidfCosineStrategy` threshold of 0.9 is *not* the same fraction as a `JaccardNgramStrategy` threshold of 0.9. Read the strategy's docstring before tuning. - **Conflating `severity="warning"` with "ignore"**. Warnings still appear in the manifest's `leakage_report`; downstream auditors will see them. They don't gate the run, but they're not invisible. - **Believing CV alone gives an OOD claim.** K-fold CV with random partitioning measures interpolation across your sample, not generalization to a new population. For OOD claims, combine [`SourceDisjointKFoldSplitter`](../api/splits.md) + `CrossSplitLeakageCheck` + a held-out *final* test set never used in development. ## Putting it all together ```python from eval_toolkit import ( EvalSlice, ExactDuplicateCheck, NormalizedFormLeakageCheck, LabelConflictCheck, CrossSplitLeakageCheck, run_leakage_checks, ) train = pd.DataFrame({"text": ["clean a", "clean b"], "label": [0, 1]}) test = pd.DataFrame({"text": ["clean c", "clean d"], "label": [1, 0]}) splits = { "train": EvalSlice(name="train", df=train), "test": EvalSlice(name="test", df=test), } report = run_leakage_checks( [ ExactDuplicateCheck(), # warning NormalizedFormLeakageCheck(), # error LabelConflictCheck(), # error CrossSplitLeakageCheck(train_split="train"), # error ], splits, ) print(f"clean splits: has_errors={report.has_errors()}") ``` ## Further reading - Kapoor, S. & Narayanan, A. *Leakage and the reproducibility crisis in machine-learning-based science.* Patterns 4(9), 2023. [arXiv:2207.07048](https://arxiv.org/abs/2207.07048) - Pellizzoni, S. et al. *Don't push the button! Data leakage risks in ML and transfer learning.* Springer AI Review, 2025. [DOI](https://link.springer.com/article/10.1007/s10462-025-11326-3) - Yan, X. et al. *Hidden Leaks in Time Series Forecasting.* arXiv 2025. [arXiv:2512.06932](https://arxiv.org/html/2512.06932v1) - OWASP. *LLM01:2025 Prompt Injection.* [genai.owasp.org](https://genai.owasp.org/llmrisk/llm01-prompt-injection/) - PI_HackAPrompt_SQuAD analysis (2025). [arXiv:2505.04806](https://arxiv.org/html/2505.04806v1) - Recht, B. et al. *Do ImageNet classifiers generalize to ImageNet?* ICML 2019. See also: [splits.md](splits.md), [reproducibility.md](reproducibility.md).