eval_toolkit.text_dedup#

DEFAULT_DEDUP_THRESHOLD

Convert a string or number to a floating-point number, if possible.

DedupReport

Outcome of a near-dedup pass.

EmbeddingCosineStrategy

Cosine similarity on caller-supplied embeddings.

ExactNormalizedHashStrategy

SHA-256 hash-bucket dedup; similarities are exactly {0.0, 1.0}.

JaccardNgramStrategy

Set-based n-gram Jaccard similarity — diagnostic / small-corpus only.

MinHashLSHStrategy

Approximate Jaccard via MinHash + LSH banding (Broder 1997, Indyk-Motwani 1998).

SimilarityAuditFinding

One high-similarity pair found during a non-dropping audit.

SimilarityAuditReport

Non-dropping source/label-aware similarity audit report.

SimilarityStrategy

Pluggable similarity backend for near_dedup() / cross_dedup().

TfidfCosineStrategy

TF-IDF (n-gram) cosine similarity — the default lexical near-dedup.

audit_source_label_similarity

Report high-similarity pairs without dropping rows.

cross_dedup

Return eval indices to KEEP (drop those near-duplicate to any train text).

near_dedup

Greedy near-deduplication via a pluggable similarity strategy.

normalize_text_for_dedup

Canonical text normalization for hashing and deduplication.

sha256_text

SHA-256 hex digest of (optionally normalized) text.