Leakage#
Background (skip if you’ve internalized this). Leakage is when an evaluation procedure lets information about the test set influence the model — directly (a row appears in both train and test), indirectly (correlated rows; same patient in both folds), or structurally (the model’s feature pipeline saw test data during fitting). It inflates reported metrics by amounts that often exceed the gap between papers. Kapoor & Narayanan (2023) document 294 papers across 17 fields where leakage was the cause of non-replication. Treat leakage detection as a first-class eval step, not a one-off audit.
This chapter is a taxonomy of leakage classes, the eval-toolkit primitive that catches each one, and the pitfalls that make leakage hard to spot. Every code block runs under Sybil — copy them as starting points.
Setup#
import pandas as pd
from eval_toolkit import EvalSlice, run_leakage_checks
We’ll reuse two small split fixtures throughout:
clean_train = pd.DataFrame(
{"text": [f"unique_train_{i}" for i in range(20)],
"label": [i % 2 for i in range(20)]}
)
clean_test = pd.DataFrame(
{"text": [f"unique_test_{i}" for i in range(10)],
"label": [i % 2 for i in range(10)]}
)
clean_splits = {
"train": EvalSlice(name="train", df=clean_train),
"test": EvalSlice(name="test", df=clean_test),
}
Taxonomy#
The eight classes below cover what consumers of this toolkit will hit in binary-classification work. Subclasses (e.g., temporal, group, transfer- learning) are grouped under the parent that catches them.
# |
Class |
Toolkit primitive |
Severity default |
|---|---|---|---|
1 |
Exact duplicate |
|
warning |
2 |
Near duplicate |
|
warning |
3 |
Encoding-obfuscated duplicate |
|
error |
4 |
Cross-split (train↔eval) |
|
error |
5 |
Label conflict |
|
error |
6 |
Group leakage |
|
error |
7 |
Temporal leakage |
|
error |
8 |
Target / pretraining / distribution shift |
(out of toolkit scope) |
n/a |
1. Exact duplicates#
Definition. Two rows whose normalized text is identical (same string after Unicode-NFC + whitespace collapse + casefold).
Harm. Within-split duplicates inflate observed metrics by the amount of redundancy in your data: a duplicated rare positive double-counts in the recall numerator. Across-split duplicates are a bigger issue — see class 4.
Primitive. ExactDuplicateCheck
wraps text_dedup.ExactNormalizedHashStrategy (whitespace-normalized
SHA-256 buckets). Default severity is "warning" because exact dupes
within a split are common in real corpora; opt into "error" for strict
mode.
from eval_toolkit import ExactDuplicateCheck
dup_train = pd.DataFrame(
{"text": ["hello world", "hello world", "unique"], "label": [0, 0, 1]}
)
dup_splits = {"train": EvalSlice(name="train", df=dup_train)}
finding = ExactDuplicateCheck().validate(dup_splits)
print(f"severity={finding.severity} n_affected={finding.n_affected}")
print(f"drop_indices={finding.drop_indices}")
2. Near duplicates#
Definition. Pairs of rows whose textual similarity exceeds a threshold under some sense — TF-IDF cosine, embedding cosine, MinHash Jaccard.
Harm. Same as exact duplicates but with a tunable strictness. Common sources: paraphrased social-media posts, scraped duplicates with minor HTML differences, multiple translations of the same source.
Primitive. NearDuplicateCheck
takes a strategy: SimilarityStrategy and a threshold — pluggable so
the sense of similarity is opt-in. Defaults to TfidfCosineStrategy at
0.9.
from eval_toolkit import NearDuplicateCheck
paraphrases = pd.DataFrame(
{"text": [
"the quick brown fox jumps",
"the quick brown fox jumps!",
"lorem ipsum dolor sit amet",
],
"label": [0, 0, 1]}
)
splits = {"train": EvalSlice(name="train", df=paraphrases)}
finding = NearDuplicateCheck(threshold=0.8).validate(splits)
print(f"caught {finding.n_affected} near-dupe rows")
For semantic dedup, swap the strategy:
# from eval_toolkit import EmbeddingCosineStrategy
# strategy = EmbeddingCosineStrategy(embedder=my_sentence_transformer)
# NearDuplicateCheck(strategy=strategy, threshold=0.85)
3. Encoding-obfuscated duplicates (NEW in v0.7.0)#
Definition. Rows that look different but normalize to the same text under aggressive Unicode transforms — NFKC + zero-width-strip + Symbol-Other-strip (most emoji) + casefold.
Harm. This is the dominant unfixed leakage class in prompt-injection corpora. The PI_HackAPrompt_SQuAD work (2025) reports that encoding-obfuscated duplicates detect at only 21.3 % under naive dedup but achieve 76.2 % attack success rate against deployed classifiers. If your eval set has zero-width-padded copies of your train set, you are measuring nothing.
Primitive. NormalizedFormLeakageCheck
applies the aggressive normalization before hashing. Default severity is
"error" — encoding-obfuscated overlap is dangerous enough that you
should opt out (to "warning") only when upstream cleaning already
handles it.
from eval_toolkit import NormalizedFormLeakageCheck
# Same string, with zero-width characters injected.
obfuscated = pd.DataFrame(
{"text": ["hello world", "hello world", "unrelated"],
"label": [0, 1, 0]}
)
splits = {"test": EvalSlice(name="test", df=obfuscated)}
finding = NormalizedFormLeakageCheck().validate(splits)
print(f"caught {finding.n_affected} obfuscation collision(s)")
What NOT to do. Don’t rely on
text.lower().strip()for prompt-injection eval. The class of attacks documented in OWASP LLM01:2025 includes zero-width injection, emoji-padded payloads, alt-encoding — all defeated only by NFKC + Symbol-Other strip.
4. Cross-split (train ↔ eval) leakage#
Definition. A row in your test slice that is identical or near- identical to a row in your training slice.
Harm. The canonical leakage. Your model has memorized the test point; your “test metric” is a memorization metric. Inflations of 5–30 % PR-AUC have been reported on text classification benchmarks where dedup was skipped.
Primitive. CrossSplitLeakageCheck
wraps text_dedup.cross_dedup. Default severity "error" — this is the
genuinely dangerous one.
from eval_toolkit import CrossSplitLeakageCheck
train = pd.DataFrame({"text": ["hello world a longer string", "lorem"], "label": [0, 1]})
test = pd.DataFrame({"text": ["hello world a longer string", "different"], "label": [1, 0]})
splits = {
"train": EvalSlice(name="train", df=train),
"test": EvalSlice(name="test", df=test),
}
finding = CrossSplitLeakageCheck(train_split="train").validate(splits)
print(f"{finding.n_affected} test row(s) leak from train")
5. Label conflict#
Definition. The same (or near-same) text appearing with different labels across slices, or across the source datasets you concatenated to build a slice.
Harm. Annotator disagreement aside, label conflicts mean your eval is ambiguous. The same prompt being labeled “injection” in train and “benign” in test poisons the evaluation regardless of which label is “correct”.
Primitive. LabelConflictCheck.
Replaces the cross-source conflict resolution that
prompt-injection-sdd and prompt_injection_detector reimplement
(~50 LOC each).
from eval_toolkit import LabelConflictCheck
train = pd.DataFrame({"text": ["x", "y"], "label": [0, 1]})
test = pd.DataFrame({"text": ["x", "z"], "label": [1, 0]}) # conflict on "x"
splits = {
"train": EvalSlice(name="train", df=train),
"test": EvalSlice(name="test", df=test),
}
finding = LabelConflictCheck().validate(splits)
print(f"{finding.n_affected} conflicting rows across {len(splits)} splits")
6. Group leakage#
Definition. A grouping unit (patient, user, document, source) that appears in more than one split.
Harm. Within-group correlation makes the test metric a memorization of the group, not a generalization measure. The classic medical-imaging failure: same patient’s images in train and test → near-100 % accuracy that vanishes on truly held-out patients.
Primitive. GroupLeakageCheck
takes a group_col. Severity "error". Use with
GroupKFoldSplitter or
SourceDisjointKFoldSplitter — see
splits.md.
from eval_toolkit import GroupLeakageCheck
train = pd.DataFrame(
{"text": ["a", "b", "c"], "label": [0, 1, 0], "group_id": [1, 2, 3]}
)
test = pd.DataFrame(
{"text": ["d", "e"], "label": [1, 0], "group_id": [1, 4]} # group 1 spans
)
splits = {
"train": EvalSlice(name="train", df=train),
"test": EvalSlice(name="test", df=test),
}
finding = GroupLeakageCheck(group_col="group_id").validate(splits)
print(f"groups spanning splits: {sorted(finding.evidence['violating_groups'].keys())}")
7. Temporal leakage#
Definition. Train data with timestamps later than test data — the model is allowed to “see the future” during training.
Harm. Temporal correlation: features are non-stationary, label distributions drift, the test metric measures interpolation rather than forecasting. Recently formalized by Yan et al. (Hidden Leaks in Time Series Forecasting, 2025) which catalogs LSTM rolling-window leakage and validation-strategy leakage as distinct subclasses.
Primitive.
TemporalLeakageCheck — given a
time_col and split_order, asserts every earlier split’s max(time)
≤ next split’s min(time).
from eval_toolkit import TemporalLeakageCheck
train = pd.DataFrame({"text": ["a", "b"], "label": [0, 1], "t": [10, 20]})
test = pd.DataFrame({"text": ["c", "d"], "label": [1, 0], "t": [5, 15]}) # bad
splits = {
"train": EvalSlice(name="train", df=train),
"test": EvalSlice(name="test", df=test),
}
finding = TemporalLeakageCheck(
time_col="t", split_order=("train", "test")
).validate(splits)
print(f"violations={len(finding.evidence['violations'])}")
8. Target / pretraining / distribution-shift leakage#
Definition (umbrella). Three related classes that the toolkit does not try to detect automatically:
Target leakage: a feature trivially predicts the label because it’s computed after the label is determined (e.g., “user clicked ‘unsubscribe’” as a feature for a churn label).
Transfer-learning / pretraining-corpus leakage: a pretrained model has already seen part of your eval set during pretraining (Don’t Push the Button, Pellizzoni et al. 2025). Common with HuggingFace pretrained transformers and public benchmarks.
Distribution shift: the production population differs from your eval population in ways that change the metric’s interpretation (covariate shift, label shift, concept drift; Recht et al. 2019).
Why the toolkit doesn’t catch these. Target leakage requires domain-specific feature semantics. Pretraining leakage requires access to the pretraining corpus. Distribution shift requires the production distribution. All three are consumer-side concerns, not eval-toolkit concerns. The pointers above are required reading.
PyTorch & transformer-specific pitfalls#
Tokenization-level duplicates#
Two strings can look distinct in plain text but tokenize identically once a transformer’s BPE / SentencePiece / WordPiece tokenizer is applied. Don’t dedup on raw text only when your downstream model is a transformer — dedup on the tokenizer’s output too.
Since v0.37.0 the toolkit ships TokenizationLeakageCheck for this
directly. It takes a tokenizer callable (any HuggingFace
PreTrainedTokenizerBase, or any Callable[[str], Mapping] returning
HF-style {"input_ids": [...]} output) and dedups on the input_ids
tuple per row:
from transformers import AutoTokenizer
from eval_toolkit.leakage import TokenizationLeakageCheck
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
check = TokenizationLeakageCheck(tokenizer=tokenizer)
finding = check.validate(splits)
Optional install via the [transformers] extra (intentionally not
in [all] / [dev] — transformers transitively pulls torch (~700MB)
per the [embeddings] precedent). The check itself does not import
transformers; consumers pass an already-instantiated tokenizer.
Pin the tokenizer for audit reproducibility. Different transformers
releases can emit different input_ids for the same text (added-vocab
changes between minors), which would silently flip dedup outcomes.
Capture both the package version and a SHA-256 of the tokenizer.json
in your RunManifest.
For consumers who want to avoid the [transformers] extra entirely:
emulate the check by adding a tokens column to your dataframe and
pointing ExactDuplicateCheck at it via a custom feature view on the
EvalSlice. The TokenizationLeakageCheck is just sugar over that
pattern with explicit tokenizer-pin guidance.
Pretraining contamination#
Public benchmarks (GLUE, SuperGLUE, MMLU, even niche prompt-injection corpora) often appear verbatim in pretraining corpora like Common Crawl. A “9X.X % accuracy” on a public benchmark with a public-pretrained backbone is rarely a generalization claim. Two mitigations:
Audit publication dates. A model pretrained in 2024-Q3 has seen anything published before then. Use eval sets curated after your model’s pretraining cutoff.
Use embedding-space dedup.
EmbeddingCosineStrategywrapped in aNearDuplicateChecklets you flag eval rows that embed-near any train row, including rough paraphrases. Doesn’t catch pretraining contamination but does catch your own train-set contamination of an eval set.
Pitfalls / Common mistakes#
Running checks AFTER training. By then the harm is done. Run leakage checks at load time, before any model fits anything. The recommended pattern is to pass
leakage_checks=[...]toevaluate(...)which fails the run onerror-severity findings before scoring starts.Trusting per-source dedup as cross-source dedup. Two corpora that each look clean can still leak across each other. Always run
CrossSplitLeakageCheckbetween every (train, eval) pair, not just within each.Picking thresholds without understanding the strategy. A
TfidfCosineStrategythreshold of 0.9 is not the same fraction as aJaccardNgramStrategythreshold of 0.9. Read the strategy’s docstring before tuning.Conflating
severity="warning"with “ignore”. Warnings still appear in the manifest’sleakage_report; downstream auditors will see them. They don’t gate the run, but they’re not invisible.Believing CV alone gives an OOD claim. K-fold CV with random partitioning measures interpolation across your sample, not generalization to a new population. For OOD claims, combine
SourceDisjointKFoldSplitter+CrossSplitLeakageCheck+ a held-out final test set never used in development.
Putting it all together#
from eval_toolkit import (
EvalSlice, ExactDuplicateCheck, NormalizedFormLeakageCheck,
LabelConflictCheck, CrossSplitLeakageCheck, run_leakage_checks,
)
train = pd.DataFrame({"text": ["clean a", "clean b"], "label": [0, 1]})
test = pd.DataFrame({"text": ["clean c", "clean d"], "label": [1, 0]})
splits = {
"train": EvalSlice(name="train", df=train),
"test": EvalSlice(name="test", df=test),
}
report = run_leakage_checks(
[
ExactDuplicateCheck(), # warning
NormalizedFormLeakageCheck(), # error
LabelConflictCheck(), # error
CrossSplitLeakageCheck(train_split="train"), # error
],
splits,
)
print(f"clean splits: has_errors={report.has_errors()}")
Further reading#
Kapoor, S. & Narayanan, A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4(9), 2023. arXiv:2207.07048
Pellizzoni, S. et al. Don’t push the button! Data leakage risks in ML and transfer learning. Springer AI Review, 2025. DOI
Yan, X. et al. Hidden Leaks in Time Series Forecasting. arXiv 2025. arXiv:2512.06932
OWASP. LLM01:2025 Prompt Injection. genai.owasp.org
PI_HackAPrompt_SQuAD analysis (2025). arXiv:2505.04806
Recht, B. et al. Do ImageNet classifiers generalize to ImageNet? ICML 2019.
See also: splits.md, reproducibility.md.