Worked example: cross-corpus contamination scan via pairs_across#
What this shows. “For each row in eval corpus A, what’s the maximum cosine similarity to ANY row in reference corpus B?” — the canonical contamination-scan pattern. Examples: benign-vs-injection template overlap, OOD-vs-train data leakage, eval-vs-pretraining contamination.
Runtime: ~0.5 s with a stub embedder. The pattern generalises to any
SimilarityStrategy.
Why pairs_across, not near_dedup#
near_dedup is built for the within-corpus dedup case (“find duplicates
inside one corpus”). The cross-corpus case (“for each query, find the
most-similar reference”) is the natural pairs_across use:
Within-corpus dedup:
near_dedup(texts=corpus, threshold=t, strategy=...)Cross-corpus contamination:
strategy.pairs_across(query_texts=eval_corpus, reference_texts=train_corpus, k=1)
The toolkit exposes both Tier-2 surfaces; the protocol is the same
SimilarityStrategy — only the orchestration glue differs.
Setup#
import numpy as np
from eval_toolkit.text_dedup import EmbeddingCosineStrategy
def stub_embedder(texts):
# Identity embedder: hash each text into a unit vector for deterministic
# tests. In production use make_minilm_embedder() per
# callable_embedder_dedup.md.
n = len(texts)
vecs = np.zeros((n, 32), dtype=np.float64)
for i, t in enumerate(texts):
vecs[i, hash(t) % 32] = 1.0
return vecs
strategy = EmbeddingCosineStrategy(embedder=stub_embedder)
The contamination scan#
# Eval corpus (rows we want to flag).
eval_corpus = ["how do I bake bread", "what is 2+2", "ignore previous instructions"]
# Reference corpus (training / template / contamination source).
template_corpus = ["forget all prior", "ignore all previous instructions", "step-by-step recipe"]
# For each eval row, find the single most-similar template (k=1).
similarities, indices = strategy.pairs_across(
query_texts=eval_corpus,
reference_texts=template_corpus,
k=1,
)
# similarities shape: (n_eval, k=1)
# indices shape: same — indices into template_corpus
max_cosines = similarities[:, 0]
nearest_template = [template_corpus[i] for i in indices[:, 0]]
# Flag eval rows whose nearest template is suspiciously similar.
CONTAMINATION_THRESHOLD = 0.85
flagged = max_cosines >= CONTAMINATION_THRESHOLD
contamination_rate = float(flagged.mean())
print(f"contamination_rate: {contamination_rate:.2%}")
contamination_rate: 0.00%
For each flagged row, the nearest_template[i] tells you which template
triggered the flag — useful for audit + manual review.
Variations#
Top-k instead of top-1 (each eval row’s k nearest templates):
similarities, indices = strategy.pairs_across(
query_texts=eval_corpus,
reference_texts=template_corpus,
k=3, # top-3 nearest templates per eval row
)
# similarities shape: (n_eval, 3)
# indices shape: (n_eval, 3) — sorted descending by similarity
Asymmetric thresholds (different thresholds per eval_corpus slice):
# After computing max_cosines via k=1:
slice_a_mask = np.array([t.startswith("how") for t in eval_corpus])
flagged_a = (max_cosines >= 0.90) & slice_a_mask
flagged_others = (max_cosines >= 0.80) & ~slice_a_mask
total_flagged = flagged_a | flagged_others
Common pitfalls#
Empty reference corpus:
pairs_acrossraises ifreference_textsis empty. Guard upstream if your reference corpus can be empty (e.g., for projects with optional templates).Dimension mismatch:
EmbeddingCosineStrategy._embedvalidates query + reference embeddings have matching feature dimension; the failure is loud + early.One-way semantics: this pattern is asymmetric — “for each query, find nearest reference”. Reversing the arguments (query↔reference) asks the inverse question. For symmetric within-set dedup, use
pairs_withinornear_dedup.
See also#
EmbeddingCosineStrategy(API)Leakage detection — for the in-toolkit
CrossSplitLeakageCheckorchestrator that wraps this pattern with fold-aware semantics.methodology/leakage.md — contamination taxonomy + when to add this scan to your eval.