eval_toolkit.text_dedup#
|
Convert a string or number to a floating-point number, if possible. |
|
Outcome of a near-dedup pass. |
|
Cosine similarity on caller-supplied embeddings. |
|
SHA-256 hash-bucket dedup; similarities are exactly |
|
Set-based n-gram Jaccard similarity — diagnostic / small-corpus only. |
|
Approximate Jaccard via MinHash + LSH banding (Broder 1997, Indyk-Motwani 1998). |
|
One high-similarity pair found during a non-dropping audit. |
|
Non-dropping source/label-aware similarity audit report. |
|
Pluggable similarity backend for |
|
TF-IDF (n-gram) cosine similarity — the default lexical near-dedup. |
|
Report high-similarity pairs without dropping rows. |
|
Return eval indices to KEEP (drop those near-duplicate to any train text). |
|
Greedy near-deduplication via a pluggable similarity strategy. |
|
Canonical text normalization for hashing and deduplication. |
|
SHA-256 hex digest of (optionally normalized) text. |