Worked example: callable-embedder pattern for EmbeddingCosineStrategy#

What this shows. Semantic dedup with a caller-owned embedder. The toolkit’s EmbeddingCosineStrategy owns cosine + k-NN; the caller owns the embedder. This keeps the toolkit dep-free of any specific embedding library (sentence-transformers, OpenAI, local PyTorch, etc.) while still offering a turnkey strategy class.

Runtime: the runnable example below uses a stub embedder (numpy one-hot vectors) — completes in <1 s. The MiniLM and OpenAI patterns are illustrative (skipped under Sybil).

Pattern 1: stub embedder (testable, runnable in CI)#

The first thing to know is that any callable Callable[[Sequence[str]], np.ndarray] that returns a 2-D array of shape (n, d) works. Use this for unit tests:

import numpy as np
from eval_toolkit.text_dedup import EmbeddingCosineStrategy, near_dedup

def stub_embedder(texts):
    # One-hot per text index — behaves like exact-match on text identity.
    return np.eye(len(texts))

strategy = EmbeddingCosineStrategy(embedder=stub_embedder)
report = near_dedup(
    texts=["hello world", "goodbye", "hello world"],
    threshold=0.80,
    strategy=strategy,
)
# DedupReport carries kept_indices + dropped_pairs (n_kept / n_dropped
# are properties, not callables).
# In the stub-embedder one-hot setup, each text is its own basis vector,
# so cosine between distinct texts is 0 → no duplicates dropped at 0.80.
assert report.n_dropped == 0
assert report.n_kept == 3

Pattern 3: custom embedder (OpenAI, in-house model)#

The factory pattern generalises — wrap any embedder API in a Callable that takes Sequence[str] and returns np.ndarray of shape (n, d):

import numpy as np
from openai import OpenAI
from eval_toolkit.text_dedup import EmbeddingCosineStrategy, near_dedup

client = OpenAI()

def openai_embedder(texts):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=list(texts),
    )
    return np.array([e.embedding for e in response.data], dtype=np.float64)

strategy = EmbeddingCosineStrategy(embedder=openai_embedder)
report = near_dedup(texts=corpus, threshold=0.85, strategy=strategy)

Same shape contract; the toolkit doesn’t care which model produced the vectors — only that you return (n, d) floats with consistent dimensionality across calls.

Common pitfalls#

  • Embedder consistency: the same callable must return same-shape vectors across calls. If your embedder model_id changes mid-run, the cosine k-NN will silently produce wrong neighbors.

  • Dimension mismatch on pairs_across: EmbeddingCosineStrategy raises ValueError if query and reference embeddings have different feature dimensions — the failure is loud, but only at the pairs_across call boundary.

  • Batching for cost / throughput: the toolkit doesn’t batch — your embedder owns that. make_minilm_embedder(batch_size=64) is the default sentence-transformers batch.

See also#

  • EmbeddingCosineStrategy (API)

  • make_minilm_embedder() (API)

  • methodology/text_dedup.md for strategy selection guidance (TF-IDF vs MinHash-LSH vs Embedding vs Jaccard).