# Text deduplication > **Background** *(skip if you've internalized this)*. Deduplication is > the engine behind every leakage-detection check (see > [leakage.md](leakage.md)) and a critical preprocessing step in its > own right. The choice of *similarity strategy* — exact-hash, TF-IDF > cosine, embedding cosine, MinHash-LSH, n-gram Jaccard — determines > what counts as "similar". Picking the wrong one either misses real > dupes (false-negative leakage that inflates eval metrics) or flags > spurious matches (false-positive that drops valid eval rows). This chapter covers [`SimilarityStrategy`](../api/text_dedup.md) — the toolkit's pluggable similarity Protocol — and the five reference impls: when to use each, threshold tuning, and how dedup composes with [`LeakageCheck`](leakage.md). ## Setup ```python from eval_toolkit import ( SimilarityStrategy, TfidfCosineStrategy, ExactNormalizedHashStrategy, JaccardNgramStrategy, MinHashLSHStrategy, near_dedup, cross_dedup, ) ``` A small fixture used throughout: ```python texts = [ "the quick brown fox jumps", "the quick brown fox jumps!", # near-dupe (punctuation) "the slow blue cat naps", "lorem ipsum dolor sit amet", "lorem ipsum dolor sit amet.", # near-dupe (period) ] ``` (strategies)= ## The five strategies | Strategy | Sense of similarity | Cost | Best for | |---|---|---|---| | `ExactNormalizedHashStrategy` | Whitespace-normalized SHA-256 | O(n) | Catching exact dupes after normalization; near-zero false positives | | `TfidfCosineStrategy` | Sparse TF-IDF cosine + k-NN | O(n × k_neighbors) | **Default**; balanced precision/recall on natural text | | `EmbeddingCosineStrategy` | Caller-supplied embedding cosine | O(n²) (or use ANN) | Semantic dedup ("paraphrase" detection) | | `MinHashLSHStrategy` | MinHash + Locality-Sensitive Hashing | O(n × b × r) | Web-scale corpora where TF-IDF doesn't fit | | `JaccardNgramStrategy` | Set Jaccard on n-grams (brute-force) | O(n²) | Short texts; small n; diagnostic only | ### `ExactNormalizedHashStrategy` Catches exact duplicates after Unicode normalization (NFKC + casefold + whitespace collapse). Returns similarity 1.0 on collisions and 0.0 otherwise. ```python from eval_toolkit.text_dedup import near_dedup as _near_dedup report = near_dedup(texts, threshold=0.5, strategy=ExactNormalizedHashStrategy()) print(f"kept: {len(report.kept_indices)}; dropped: {[p[0] for p in report.dropped_pairs]}") ``` The "punctuation-different" pair above (idx 0/1) won't be caught by exact-hash because `"...fox jumps"` and `"...fox jumps!"` differ after normalization. Use TF-IDF for that. ### `TfidfCosineStrategy` (default) Sparse TF-IDF (sklearn's `TfidfVectorizer` with character n-grams), cosine similarity, top-k nearest neighbors per query. Default `threshold=0.9` is conservative; lower for paraphrase-y domains. ```python report = near_dedup(texts, threshold=0.85, strategy=TfidfCosineStrategy()) print(f"kept: {len(report.kept_indices)}; dropped: {[p[0] for p in report.dropped_pairs]}") ``` The punctuation-different pair shows up as similarity ~0.95–0.99 under TF-IDF char-n-grams; the threshold determines whether it's flagged. ### `EmbeddingCosineStrategy` Catches semantic paraphrases ("ignore all previous instructions" vs "please disregard the prior directives"). Requires the caller to supply pre-computed embeddings; the strategy itself just does cosine. ```python # Pseudo-code: requires embeddings (e.g., sentence-transformers) # from sentence_transformers import SentenceTransformer # embedder = SentenceTransformer("all-MiniLM-L6-v2") # embeddings = embedder.encode(texts, normalize_embeddings=True) # strategy = EmbeddingCosineStrategy(embeddings) # report = near_dedup(texts, threshold=0.9, strategy=strategy) ``` The toolkit's strategy doesn't load a model — keeping `transformers` out of core deps. Consumer side: pre-encode, pass the array. ### `MinHashLSHStrategy` For corpora too large to fit in TF-IDF memory (typically n > 100 k texts). Uses MinHash signatures + LSH banding for approximate near-duplicate detection. Has tunable false-negative / false-positive trade-offs via the `bands` and `rows_per_band` parameters; see the docstring at `src/eval_toolkit/text_dedup.py` for the LSH theory. ### `JaccardNgramStrategy` Brute-force Jaccard similarity on n-grams. O(n²) — only for small n or as a diagnostic. The toolkit ships it for completeness; in production reach for `MinHashLSHStrategy` if you need the Jaccard sense of similarity at scale. (thresholds)= ## Threshold tuning The threshold is strategy-specific: - **TF-IDF cosine**: `[0, 1]`. Default 0.9. Lower for paraphrase detection (e.g., 0.7); higher for strict near-exact (e.g., 0.95). - **Exact-hash**: `0.5` (any positive threshold; the strategy returns 0.0 / 1.0). - **Embedding cosine**: `[-1, 1]` typically. 0.85–0.95 for sentence- embedding strategies; depends on the embedder. - **MinHash LSH**: `[0, 1]`. 0.7–0.9 typical. Tune `bands` + `rows_per_band` to control the threshold's effective probability profile. - **Jaccard**: `[0, 1]`. 0.5–0.8 typical for short n-grams. > **What NOT to do.** Don't compare a TF-IDF threshold of 0.9 to a > MinHash threshold of 0.9 and conclude they're equivalent — > different strategies put 0.9 at different points on their > precision / recall curve. (composition)= ## How dedup composes with `LeakageCheck` [`NearDuplicateCheck`](../api/leakage.md) wraps `near_dedup` with a `SimilarityStrategy`-typed `strategy` parameter. Same for `CrossSplitLeakageCheck` wrapping `cross_dedup`. So the strategy-pluggability flows from `text_dedup` → `leakage` → `harness.evaluate(leakage_checks=...)`. This means one swap point gives you semantic-dedup leakage checks: ```python # from eval_toolkit import EmbeddingCosineStrategy, NearDuplicateCheck # # In your evaluate(...) call: # leakage_checks = [ # NearDuplicateCheck(threshold=0.85, strategy=EmbeddingCosineStrategy(emb)), # ] ``` For the prompt-injection-eval case the [`NormalizedFormLeakageCheck`](leakage.md#encoding-obfuscation) is sometimes more useful than embedding-based dedup — it catches the encoding-obfuscation attack class specifically. (cross-source)= ## Cross-source dedup (train ↔ eval) [`cross_dedup`](../api/text_dedup.md) is the lower- level primitive: given two text lists (train, eval), returns the eval indices to *keep* (i.e., those NOT near-duplicate to any train text). [`CrossSplitLeakageCheck`](leakage.md#cross-split) wraps this for the harness. (lsh)= ## LSH false-negative rates MinHash + LSH is approximate — there's a per-pair probability that genuine near-duplicates are missed. The probability depends on: - Number of bands (b) - Rows per band (r) - True Jaccard similarity (s) The threshold curve `1 - (1 - s^r)^b` is the probability of a band collision given Jaccard `s`. The "S-curve" inflection point is where your effective threshold lives; tune `(b, r)` to put it at your target similarity. For `bands=20, rows_per_band=5` (default), the inflection is around Jaccard ≈ 0.55. For tighter near-duplicate detection (e.g., target Jaccard ≈ 0.85), use `bands=10, rows_per_band=10`. (text_dedup-pitfalls)= ## Pitfalls / Common mistakes - **Using `JaccardNgramStrategy` on > 1 k texts.** O(n²) brute force; hangs for hours on real corpora. Use `MinHashLSHStrategy` for the same sense of similarity at scale. - **Forgetting that TF-IDF requires fitting on the corpus.** `TfidfCosineStrategy` learns the vocabulary from the input texts; applying a strategy fit on corpus A to corpus B will silently produce zero similarity for any out-of-vocabulary token. The toolkit's `near_dedup` always fits the strategy on the input; `cross_dedup` fits on the union of train + eval. Don't reuse a pre-fit strategy across calls. - **Embedding cosine without normalization.** If the embedding vectors aren't unit-norm, the cosine reduces to a dot product. Pass `normalize_embeddings=True` to the embedder, or pre-normalize. - **Threshold-on-the-test-set.** Tuning the dedup threshold on the test set you'll then dedup is a methodological smell. Use a validation slice or a small synthetic-near-duplicate ground truth. - **Trusting LSH at default `bands` / `rows`.** The defaults are sane but not universal; check your false-negative rate on a small ground-truth fixture before relying on it for leakage detection. - **Skipping cross-source dedup.** Within-source dedup misses the classic train→test-leak (an identical row in both splits). Always run `cross_dedup` (or `CrossSplitLeakageCheck`) between every (train, eval) pair. ## Further reading - Lee, K. et al. *Deduplicating training data makes language models better.* ACL 2022. — empirical effect of corpus dedup on LM performance; the canonical "near-dup matters" reference. - Penedo, G. et al. *RefinedWeb.* arXiv 2023. — a modern, large-scale dedup pipeline (TF-IDF + MinHash hybrid); good engineering reference. - Penedo, G. et al. *FineWeb-2.* arXiv 2025. — successor; documents per-strategy thresholds + ablations. - Broder, A. *On the resemblance and containment of documents.* SEQUENCES 1997. — the original MinHash paper. - Indyk, P. & Motwani, R. *Approximate nearest neighbors: towards removing the curse of dimensionality.* STOC 1998. — LSH foundations. See also: [leakage.md](leakage.md) (`NearDuplicateCheck`, `CrossSplitLeakageCheck`, `NormalizedFormLeakageCheck`), [extending.md §"Implementing a SimilarityStrategy"](../extending.md#similarity-strategy).