Leakage#

Background (skip if you’ve internalized this). Leakage is when an evaluation procedure lets information about the test set influence the model — directly (a row appears in both train and test), indirectly (correlated rows; same patient in both folds), or structurally (the model’s feature pipeline saw test data during fitting). It inflates reported metrics by amounts that often exceed the gap between papers. Kapoor & Narayanan (2023) document 294 papers across 17 fields where leakage was the cause of non-replication. Treat leakage detection as a first-class eval step, not a one-off audit.

This chapter is a taxonomy of leakage classes, the eval-toolkit primitive that catches each one, and the pitfalls that make leakage hard to spot. Every code block runs under Sybil — copy them as starting points.

Setup#

import pandas as pd
from eval_toolkit import EvalSlice, run_leakage_checks

We’ll reuse two small split fixtures throughout:

clean_train = pd.DataFrame(
    {"text": [f"unique_train_{i}" for i in range(20)],
     "label": [i % 2 for i in range(20)]}
)
clean_test = pd.DataFrame(
    {"text": [f"unique_test_{i}" for i in range(10)],
     "label": [i % 2 for i in range(10)]}
)
clean_splits = {
    "train": EvalSlice(name="train", df=clean_train),
    "test":  EvalSlice(name="test",  df=clean_test),
}

Taxonomy#

The eight classes below cover what consumers of this toolkit will hit in binary-classification work. Subclasses (e.g., temporal, group, transfer- learning) are grouped under the parent that catches them.

#

Class

Toolkit primitive

Severity default

1

Exact duplicate

ExactDuplicateCheck

warning

2

Near duplicate

NearDuplicateCheck

warning

3

Encoding-obfuscated duplicate

NormalizedFormLeakageCheck

error

4

Cross-split (train↔eval)

CrossSplitLeakageCheck

error

5

Label conflict

LabelConflictCheck

error

6

Group leakage

GroupLeakageCheck

error

7

Temporal leakage

TemporalLeakageCheck

error

8

Target / pretraining / distribution shift

(out of toolkit scope)

n/a

1. Exact duplicates#

Definition. Two rows whose normalized text is identical (same string after Unicode-NFC + whitespace collapse + casefold).

Harm. Within-split duplicates inflate observed metrics by the amount of redundancy in your data: a duplicated rare positive double-counts in the recall numerator. Across-split duplicates are a bigger issue — see class 4.

Primitive. ExactDuplicateCheck wraps text_dedup.ExactNormalizedHashStrategy (whitespace-normalized SHA-256 buckets). Default severity is "warning" because exact dupes within a split are common in real corpora; opt into "error" for strict mode.

from eval_toolkit import ExactDuplicateCheck

dup_train = pd.DataFrame(
    {"text": ["hello world", "hello world", "unique"], "label": [0, 0, 1]}
)
dup_splits = {"train": EvalSlice(name="train", df=dup_train)}

finding = ExactDuplicateCheck().validate(dup_splits)
print(f"severity={finding.severity}  n_affected={finding.n_affected}")
print(f"drop_indices={finding.drop_indices}")

2. Near duplicates#

Definition. Pairs of rows whose textual similarity exceeds a threshold under some sense — TF-IDF cosine, embedding cosine, MinHash Jaccard.

Harm. Same as exact duplicates but with a tunable strictness. Common sources: paraphrased social-media posts, scraped duplicates with minor HTML differences, multiple translations of the same source.

Primitive. NearDuplicateCheck takes a strategy: SimilarityStrategy and a threshold — pluggable so the sense of similarity is opt-in. Defaults to TfidfCosineStrategy at 0.9.

from eval_toolkit import NearDuplicateCheck

paraphrases = pd.DataFrame(
    {"text": [
        "the quick brown fox jumps",
        "the quick brown fox jumps!",
        "lorem ipsum dolor sit amet",
    ],
     "label": [0, 0, 1]}
)
splits = {"train": EvalSlice(name="train", df=paraphrases)}
finding = NearDuplicateCheck(threshold=0.8).validate(splits)
print(f"caught {finding.n_affected} near-dupe rows")

For semantic dedup, swap the strategy:

# from eval_toolkit import EmbeddingCosineStrategy
# strategy = EmbeddingCosineStrategy(embedder=my_sentence_transformer)
# NearDuplicateCheck(strategy=strategy, threshold=0.85)

3. Encoding-obfuscated duplicates (NEW in v0.7.0)#

Definition. Rows that look different but normalize to the same text under aggressive Unicode transforms — NFKC + zero-width-strip + Symbol-Other-strip (most emoji) + casefold.

Harm. This is the dominant unfixed leakage class in prompt-injection corpora. The PI_HackAPrompt_SQuAD work (2025) reports that encoding-obfuscated duplicates detect at only 21.3 % under naive dedup but achieve 76.2 % attack success rate against deployed classifiers. If your eval set has zero-width-padded copies of your train set, you are measuring nothing.

Primitive. NormalizedFormLeakageCheck applies the aggressive normalization before hashing. Default severity is "error" — encoding-obfuscated overlap is dangerous enough that you should opt out (to "warning") only when upstream cleaning already handles it.

from eval_toolkit import NormalizedFormLeakageCheck

# Same string, with zero-width characters injected.
obfuscated = pd.DataFrame(
    {"text": ["hello world", "h​e​llo  world", "unrelated"],
     "label": [0, 1, 0]}
)
splits = {"test": EvalSlice(name="test", df=obfuscated)}
finding = NormalizedFormLeakageCheck().validate(splits)
print(f"caught {finding.n_affected} obfuscation collision(s)")

What NOT to do. Don’t rely on text.lower().strip() for prompt-injection eval. The class of attacks documented in OWASP LLM01:2025 includes zero-width injection, emoji-padded payloads, alt-encoding — all defeated only by NFKC + Symbol-Other strip.

4. Cross-split (train ↔ eval) leakage#

Definition. A row in your test slice that is identical or near- identical to a row in your training slice.

Harm. The canonical leakage. Your model has memorized the test point; your “test metric” is a memorization metric. Inflations of 5–30 % PR-AUC have been reported on text classification benchmarks where dedup was skipped.

Primitive. CrossSplitLeakageCheck wraps text_dedup.cross_dedup. Default severity "error" — this is the genuinely dangerous one.

from eval_toolkit import CrossSplitLeakageCheck

train = pd.DataFrame({"text": ["hello world a longer string", "lorem"], "label": [0, 1]})
test  = pd.DataFrame({"text": ["hello world a longer string", "different"], "label": [1, 0]})
splits = {
    "train": EvalSlice(name="train", df=train),
    "test":  EvalSlice(name="test",  df=test),
}
finding = CrossSplitLeakageCheck(train_split="train").validate(splits)
print(f"{finding.n_affected} test row(s) leak from train")

5. Label conflict#

Definition. The same (or near-same) text appearing with different labels across slices, or across the source datasets you concatenated to build a slice.

Harm. Annotator disagreement aside, label conflicts mean your eval is ambiguous. The same prompt being labeled “injection” in train and “benign” in test poisons the evaluation regardless of which label is “correct”.

Primitive. LabelConflictCheck. Replaces the cross-source conflict resolution that prompt-injection-sdd and prompt_injection_detector reimplement (~50 LOC each).

from eval_toolkit import LabelConflictCheck

train = pd.DataFrame({"text": ["x", "y"], "label": [0, 1]})
test  = pd.DataFrame({"text": ["x", "z"], "label": [1, 0]})  # conflict on "x"
splits = {
    "train": EvalSlice(name="train", df=train),
    "test":  EvalSlice(name="test",  df=test),
}
finding = LabelConflictCheck().validate(splits)
print(f"{finding.n_affected} conflicting rows across {len(splits)} splits")

6. Group leakage#

Definition. A grouping unit (patient, user, document, source) that appears in more than one split.

Harm. Within-group correlation makes the test metric a memorization of the group, not a generalization measure. The classic medical-imaging failure: same patient’s images in train and test → near-100 % accuracy that vanishes on truly held-out patients.

Primitive. GroupLeakageCheck takes a group_col. Severity "error". Use with GroupKFoldSplitter or SourceDisjointKFoldSplitter — see splits.md.

from eval_toolkit import GroupLeakageCheck

train = pd.DataFrame(
    {"text": ["a", "b", "c"], "label": [0, 1, 0], "group_id": [1, 2, 3]}
)
test = pd.DataFrame(
    {"text": ["d", "e"], "label": [1, 0], "group_id": [1, 4]}  # group 1 spans
)
splits = {
    "train": EvalSlice(name="train", df=train),
    "test":  EvalSlice(name="test",  df=test),
}
finding = GroupLeakageCheck(group_col="group_id").validate(splits)
print(f"groups spanning splits: {sorted(finding.evidence['violating_groups'].keys())}")

7. Temporal leakage#

Definition. Train data with timestamps later than test data — the model is allowed to “see the future” during training.

Harm. Temporal correlation: features are non-stationary, label distributions drift, the test metric measures interpolation rather than forecasting. Recently formalized by Yan et al. (Hidden Leaks in Time Series Forecasting, 2025) which catalogs LSTM rolling-window leakage and validation-strategy leakage as distinct subclasses.

Primitive. TemporalLeakageCheck — given a time_col and split_order, asserts every earlier split’s max(time) ≤ next split’s min(time).

from eval_toolkit import TemporalLeakageCheck

train = pd.DataFrame({"text": ["a", "b"], "label": [0, 1], "t": [10, 20]})
test  = pd.DataFrame({"text": ["c", "d"], "label": [1, 0], "t": [5, 15]})  # bad
splits = {
    "train": EvalSlice(name="train", df=train),
    "test":  EvalSlice(name="test",  df=test),
}
finding = TemporalLeakageCheck(
    time_col="t", split_order=("train", "test")
).validate(splits)
print(f"violations={len(finding.evidence['violations'])}")

8. Target / pretraining / distribution-shift leakage#

Definition (umbrella). Three related classes that the toolkit does not try to detect automatically:

  • Target leakage: a feature trivially predicts the label because it’s computed after the label is determined (e.g., “user clicked ‘unsubscribe’” as a feature for a churn label).

  • Transfer-learning / pretraining-corpus leakage: a pretrained model has already seen part of your eval set during pretraining (Don’t Push the Button, Pellizzoni et al. 2025). Common with HuggingFace pretrained transformers and public benchmarks.

  • Distribution shift: the production population differs from your eval population in ways that change the metric’s interpretation (covariate shift, label shift, concept drift; Recht et al. 2019).

Why the toolkit doesn’t catch these. Target leakage requires domain-specific feature semantics. Pretraining leakage requires access to the pretraining corpus. Distribution shift requires the production distribution. All three are consumer-side concerns, not eval-toolkit concerns. The pointers above are required reading.

PyTorch & transformer-specific pitfalls#

Tokenization-level duplicates#

Two strings can look distinct in plain text but tokenize identically once a transformer’s BPE / SentencePiece / WordPiece tokenizer is applied. Don’t dedup on raw text only when your downstream model is a transformer — dedup on the tokenizer’s output too.

Since v0.37.0 the toolkit ships TokenizationLeakageCheck for this directly. It takes a tokenizer callable (any HuggingFace PreTrainedTokenizerBase, or any Callable[[str], Mapping] returning HF-style {"input_ids": [...]} output) and dedups on the input_ids tuple per row:

from transformers import AutoTokenizer
from eval_toolkit.leakage import TokenizationLeakageCheck

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
check = TokenizationLeakageCheck(tokenizer=tokenizer)
finding = check.validate(splits)

Optional install via the [transformers] extra (intentionally not in [all] / [dev] — transformers transitively pulls torch (~700MB) per the [embeddings] precedent). The check itself does not import transformers; consumers pass an already-instantiated tokenizer.

Pin the tokenizer for audit reproducibility. Different transformers releases can emit different input_ids for the same text (added-vocab changes between minors), which would silently flip dedup outcomes. Capture both the package version and a SHA-256 of the tokenizer.json in your RunManifest.

For consumers who want to avoid the [transformers] extra entirely: emulate the check by adding a tokens column to your dataframe and pointing ExactDuplicateCheck at it via a custom feature view on the EvalSlice. The TokenizationLeakageCheck is just sugar over that pattern with explicit tokenizer-pin guidance.

Pretraining contamination#

Public benchmarks (GLUE, SuperGLUE, MMLU, even niche prompt-injection corpora) often appear verbatim in pretraining corpora like Common Crawl. A “9X.X % accuracy” on a public benchmark with a public-pretrained backbone is rarely a generalization claim. Two mitigations:

  1. Audit publication dates. A model pretrained in 2024-Q3 has seen anything published before then. Use eval sets curated after your model’s pretraining cutoff.

  2. Use embedding-space dedup. EmbeddingCosineStrategy wrapped in a NearDuplicateCheck lets you flag eval rows that embed-near any train row, including rough paraphrases. Doesn’t catch pretraining contamination but does catch your own train-set contamination of an eval set.

Pitfalls / Common mistakes#

  • Running checks AFTER training. By then the harm is done. Run leakage checks at load time, before any model fits anything. The recommended pattern is to pass leakage_checks=[...] to evaluate(...) which fails the run on error-severity findings before scoring starts.

  • Trusting per-source dedup as cross-source dedup. Two corpora that each look clean can still leak across each other. Always run CrossSplitLeakageCheck between every (train, eval) pair, not just within each.

  • Picking thresholds without understanding the strategy. A TfidfCosineStrategy threshold of 0.9 is not the same fraction as a JaccardNgramStrategy threshold of 0.9. Read the strategy’s docstring before tuning.

  • Conflating severity="warning" with “ignore”. Warnings still appear in the manifest’s leakage_report; downstream auditors will see them. They don’t gate the run, but they’re not invisible.

  • Believing CV alone gives an OOD claim. K-fold CV with random partitioning measures interpolation across your sample, not generalization to a new population. For OOD claims, combine SourceDisjointKFoldSplitter + CrossSplitLeakageCheck + a held-out final test set never used in development.

Putting it all together#

from eval_toolkit import (
    EvalSlice, ExactDuplicateCheck, NormalizedFormLeakageCheck,
    LabelConflictCheck, CrossSplitLeakageCheck, run_leakage_checks,
)

train = pd.DataFrame({"text": ["clean a", "clean b"], "label": [0, 1]})
test  = pd.DataFrame({"text": ["clean c", "clean d"], "label": [1, 0]})
splits = {
    "train": EvalSlice(name="train", df=train),
    "test":  EvalSlice(name="test",  df=test),
}

report = run_leakage_checks(
    [
        ExactDuplicateCheck(),                            # warning
        NormalizedFormLeakageCheck(),                     # error
        LabelConflictCheck(),                             # error
        CrossSplitLeakageCheck(train_split="train"),      # error
    ],
    splits,
)
print(f"clean splits: has_errors={report.has_errors()}")

Further reading#

  • Kapoor, S. & Narayanan, A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4(9), 2023. arXiv:2207.07048

  • Pellizzoni, S. et al. Don’t push the button! Data leakage risks in ML and transfer learning. Springer AI Review, 2025. DOI

  • Yan, X. et al. Hidden Leaks in Time Series Forecasting. arXiv 2025. arXiv:2512.06932

  • OWASP. LLM01:2025 Prompt Injection. genai.owasp.org

  • PI_HackAPrompt_SQuAD analysis (2025). arXiv:2505.04806

  • Recht, B. et al. Do ImageNet classifiers generalize to ImageNet? ICML 2019.

See also: splits.md, reproducibility.md.