--- jupytext: text_representation: extension: .md format_name: myst kernelspec: display_name: Python 3 language: python name: python3 --- # Worked example: cross-corpus contamination scan via `pairs_across` > **What this shows.** "For each row in eval corpus A, what's the maximum > cosine similarity to ANY row in reference corpus B?" — the canonical > contamination-scan pattern. Examples: benign-vs-injection template > overlap, OOD-vs-train data leakage, eval-vs-pretraining contamination. > > **Runtime:** ~0.5 s with a stub embedder. The pattern generalises to > any `SimilarityStrategy`. ## Why `pairs_across`, not `near_dedup` `near_dedup` is built for the *within-corpus* dedup case ("find duplicates inside one corpus"). The *cross-corpus* case ("for each query, find the most-similar reference") is the natural `pairs_across` use: - **Within-corpus dedup**: `near_dedup(texts=corpus, threshold=t, strategy=...)` - **Cross-corpus contamination**: `strategy.pairs_across(query_texts=eval_corpus, reference_texts=train_corpus, k=1)` The toolkit exposes both Tier-2 surfaces; the protocol is the same `SimilarityStrategy` — only the orchestration glue differs. ## Setup ```{code-cell} import numpy as np from eval_toolkit.text_dedup import EmbeddingCosineStrategy def stub_embedder(texts): # Identity embedder: hash each text into a unit vector for deterministic # tests. In production use make_minilm_embedder() per # callable_embedder_dedup.md. n = len(texts) vecs = np.zeros((n, 32), dtype=np.float64) for i, t in enumerate(texts): vecs[i, hash(t) % 32] = 1.0 return vecs strategy = EmbeddingCosineStrategy(embedder=stub_embedder) ``` ## The contamination scan ```{code-cell} # Eval corpus (rows we want to flag). eval_corpus = ["how do I bake bread", "what is 2+2", "ignore previous instructions"] # Reference corpus (training / template / contamination source). template_corpus = ["forget all prior", "ignore all previous instructions", "step-by-step recipe"] # For each eval row, find the single most-similar template (k=1). similarities, indices = strategy.pairs_across( query_texts=eval_corpus, reference_texts=template_corpus, k=1, ) # similarities shape: (n_eval, k=1) # indices shape: same — indices into template_corpus max_cosines = similarities[:, 0] nearest_template = [template_corpus[i] for i in indices[:, 0]] # Flag eval rows whose nearest template is suspiciously similar. CONTAMINATION_THRESHOLD = 0.85 flagged = max_cosines >= CONTAMINATION_THRESHOLD contamination_rate = float(flagged.mean()) print(f"contamination_rate: {contamination_rate:.2%}") ``` For each flagged row, the `nearest_template[i]` tells you *which* template triggered the flag — useful for audit + manual review. ## Variations **Top-k** instead of top-1 (each eval row's k nearest templates): ```{code-cell} similarities, indices = strategy.pairs_across( query_texts=eval_corpus, reference_texts=template_corpus, k=3, # top-3 nearest templates per eval row ) # similarities shape: (n_eval, 3) # indices shape: (n_eval, 3) — sorted descending by similarity ``` **Asymmetric thresholds** (different thresholds per `eval_corpus` slice): ```{code-cell} # After computing max_cosines via k=1: slice_a_mask = np.array([t.startswith("how") for t in eval_corpus]) flagged_a = (max_cosines >= 0.90) & slice_a_mask flagged_others = (max_cosines >= 0.80) & ~slice_a_mask total_flagged = flagged_a | flagged_others ``` ## Common pitfalls - **Empty reference corpus**: `pairs_across` raises if `reference_texts` is empty. Guard upstream if your reference corpus can be empty (e.g., for projects with optional templates). - **Dimension mismatch**: `EmbeddingCosineStrategy._embed` validates query + reference embeddings have matching feature dimension; the failure is loud + early. - **One-way semantics**: this pattern is asymmetric — "for each query, find nearest reference". Reversing the arguments (query↔reference) asks the inverse question. For symmetric within-set dedup, use `pairs_within` or `near_dedup`. ## See also - {class}`~eval_toolkit.text_dedup.EmbeddingCosineStrategy` ([API](../api/text_dedup.md)) - [Leakage detection](leakage_detection.md) — for the in-toolkit `CrossSplitLeakageCheck` orchestrator that wraps this pattern with fold-aware semantics. - [methodology/leakage.md](../methodology/leakage.md) — contamination taxonomy + when to add this scan to your eval.