Worked example: cross-corpus contamination scan via pairs_across#

What this shows. “For each row in eval corpus A, what’s the maximum cosine similarity to ANY row in reference corpus B?” — the canonical contamination-scan pattern. Examples: benign-vs-injection template overlap, OOD-vs-train data leakage, eval-vs-pretraining contamination.

Runtime: ~0.5 s with a stub embedder. The pattern generalises to any SimilarityStrategy.

Why pairs_across, not near_dedup#

near_dedup is built for the within-corpus dedup case (“find duplicates inside one corpus”). The cross-corpus case (“for each query, find the most-similar reference”) is the natural pairs_across use:

  • Within-corpus dedup: near_dedup(texts=corpus, threshold=t, strategy=...)

  • Cross-corpus contamination: strategy.pairs_across(query_texts=eval_corpus, reference_texts=train_corpus, k=1)

The toolkit exposes both Tier-2 surfaces; the protocol is the same SimilarityStrategy — only the orchestration glue differs.

Setup#

import numpy as np
from eval_toolkit.text_dedup import EmbeddingCosineStrategy

def stub_embedder(texts):
    # Identity embedder: hash each text into a unit vector for deterministic
    # tests. In production use make_minilm_embedder() per
    # callable_embedder_dedup.md.
    n = len(texts)
    vecs = np.zeros((n, 32), dtype=np.float64)
    for i, t in enumerate(texts):
        vecs[i, hash(t) % 32] = 1.0
    return vecs

strategy = EmbeddingCosineStrategy(embedder=stub_embedder)

The contamination scan#

# Eval corpus (rows we want to flag).
eval_corpus = ["how do I bake bread", "what is 2+2", "ignore previous instructions"]
# Reference corpus (training / template / contamination source).
template_corpus = ["forget all prior", "ignore all previous instructions", "step-by-step recipe"]

# For each eval row, find the single most-similar template (k=1).
similarities, indices = strategy.pairs_across(
    query_texts=eval_corpus,
    reference_texts=template_corpus,
    k=1,
)
# similarities shape: (n_eval, k=1)
# indices shape: same — indices into template_corpus

max_cosines = similarities[:, 0]
nearest_template = [template_corpus[i] for i in indices[:, 0]]

# Flag eval rows whose nearest template is suspiciously similar.
CONTAMINATION_THRESHOLD = 0.85
flagged = max_cosines >= CONTAMINATION_THRESHOLD
contamination_rate = float(flagged.mean())
print(f"contamination_rate: {contamination_rate:.2%}")
contamination_rate: 0.00%

For each flagged row, the nearest_template[i] tells you which template triggered the flag — useful for audit + manual review.

Variations#

Top-k instead of top-1 (each eval row’s k nearest templates):

similarities, indices = strategy.pairs_across(
    query_texts=eval_corpus,
    reference_texts=template_corpus,
    k=3,  # top-3 nearest templates per eval row
)
# similarities shape: (n_eval, 3)
# indices shape: (n_eval, 3) — sorted descending by similarity

Asymmetric thresholds (different thresholds per eval_corpus slice):

# After computing max_cosines via k=1:
slice_a_mask = np.array([t.startswith("how") for t in eval_corpus])
flagged_a = (max_cosines >= 0.90) & slice_a_mask
flagged_others = (max_cosines >= 0.80) & ~slice_a_mask
total_flagged = flagged_a | flagged_others

Common pitfalls#

  • Empty reference corpus: pairs_across raises if reference_texts is empty. Guard upstream if your reference corpus can be empty (e.g., for projects with optional templates).

  • Dimension mismatch: EmbeddingCosineStrategy._embed validates query + reference embeddings have matching feature dimension; the failure is loud + early.

  • One-way semantics: this pattern is asymmetric — “for each query, find nearest reference”. Reversing the arguments (query↔reference) asks the inverse question. For symmetric within-set dedup, use pairs_within or near_dedup.

See also#

  • EmbeddingCosineStrategy (API)

  • Leakage detection — for the in-toolkit CrossSplitLeakageCheck orchestrator that wraps this pattern with fold-aware semantics.

  • methodology/leakage.md — contamination taxonomy + when to add this scan to your eval.