v0.6.x → v0.7.x migration#

v0.7.0 is a BREAKING release: the select_threshold string API was removed in favor of a ThresholdSelector Protocol. Five new extension Protocols ship for downstream projects.

This guide lists every change with copy-pastable before/after.

At a glance#

Change	Type
`select_threshold(criterion=str)` removed	BREAKING
`OperatingPoint` Literal alias removed	BREAKING (only affects callers using the alias as a type hint)
`select_threshold` moved from `eval_toolkit.metrics` to `eval_toolkit.thresholds`	Re-export preserves `from eval_toolkit import select_threshold`; only direct submodule imports break
5 new Protocol surfaces (`ThresholdSelector`, `LeakageCheck`, `Splitter`, `DatasetLoader`, `Versioned`)	Additive
`evaluate(...)` gains `leakage_checks` / `on_leakage` / `on_scorer_error` parameters	Additive (defaults preserve old behavior)
`RunResult` gains `by_fold` / `fold_summary` / `schema_version` fields	Additive (default-empty / `"v1"`)

1. The `select_threshold` migration#

The single most-impactful change. Every call site updates mechanically.

v0.6	v0.7
`criterion="max_f1"`	`criterion=MaxF1Selector()`
`criterion="recall_0.90"`	`criterion=TargetRecallSelector(0.90)`
`criterion="recall_0.95"`	`criterion=TargetRecallSelector(0.95)`
`criterion="precision@0.90"` (local fork in some consumers)	`criterion=TargetPrecisionSelector(0.90)`
`criterion="recall@0.90"` (local fork in some consumers)	`criterion=TargetRecallSelector(0.90)`

Decoding the v0.7 `TypeError`#

If you call select_threshold(y, s, criterion="max_f1") on v0.7, you’ll get:

TypeError: select_threshold requires a ThresholdSelector instance (v0.7.0+);
got str='max_f1'.
Migration:
  'max_f1'      -> MaxF1Selector()
  'recall_0.90' -> TargetRecallSelector(0.90)
  'recall_0.95' -> TargetRecallSelector(0.95)
  'precision@p' -> TargetPrecisionSelector(p)
  'recall@p'    -> TargetRecallSelector(p)
See CHANGELOG v0.7.0 for the full guide.

Worked example#

import numpy as np
from eval_toolkit import select_threshold, MaxF1Selector, TargetRecallSelector

y = np.array([0, 0, 1, 1, 0, 1])
s = np.array([0.1, 0.2, 0.7, 0.9, 0.3, 0.8])

# v0.6 (now broken):
# tr = select_threshold(y, s, criterion="max_f1")  # TypeError!

# v0.7 (works):
tr = select_threshold(y, s, criterion=MaxF1Selector())
print(f"max-F1 threshold = {tr.threshold:.3f}")

tr_r = select_threshold(y, s, criterion=TargetRecallSelector(0.95))
print(f"recall>=0.95 threshold = {tr_r.threshold:.3f}")

Naming-only difference#

If you used criterion="recall@0.90" (an at-sign separator from some local forks), it’s now TargetRecallSelector(0.90) — same behavior under a normalized API. Note the recall@p semantics also standardized — see the thresholds.md Pitfall on the smallest-vs-highest-threshold-meeting-floor convention.

2. `OperatingPoint` Literal alias removed#

v0.6 exposed OperatingPoint = Literal["max_f1", "recall_0.90", "recall_0.95"] as a type hint. v0.7 removes it because the new ThresholdSelector instance API doesn’t take strings.

v0.6	v0.7
`def my_fn(crit: OperatingPoint = "max_f1") -> ...`	`def my_fn(crit: ThresholdSelector = MaxF1Selector()) -> ...`
`from eval_toolkit.metrics import OperatingPoint`	drop the import; use `ThresholdSelector` from `eval_toolkit.thresholds`

3. `select_threshold` module location#

v0.6	v0.7
`from eval_toolkit.metrics import select_threshold`	works (forwarded)
`from eval_toolkit import select_threshold`	works (always did)
`from eval_toolkit.thresholds import select_threshold`	works (canonical home)

The function moved to eval_toolkit.thresholds; eval_toolkit.metrics no longer defines it but the package-level re-export (from eval_toolkit import select_threshold) is preserved.

4. Adopting the new Protocols (optional but recommended)#

v0.7 adds five extension surfaces that the four prompt_injection_* consumer repos collectively migrated to:

ThresholdSelector — pluggable threshold rules.
LeakageCheck — pluggable leakage validators (incl. the new NormalizedFormLeakageCheck for encoding-obfuscated dupes).
Splitter — pluggable train/test splitting (incl. SourceDisjointKFoldSplitter).
DatasetLoader — pluggable named-splits dataset loading.
Versioned — opt-in per-object version capture for RunManifest.

See docs/extending.md for end-to-end recipes.

5. Adopting the new `evaluate(...)` parameters#

evaluate(...) and evaluate_folded(...) gained three optional parameters in v0.7. None break v0.6 callers; defaults preserve old behavior.

from eval_toolkit import (
    evaluate, EvalSlice,
    NormalizedFormLeakageCheck, LabelConflictCheck,
)
import pandas as pd, numpy as np

class _Scorer:
    def predict_proba(self, X):
        return np.full(len(X), 0.5)

df = pd.DataFrame({"text": ["a", "b"], "label": [0, 1]})
slice_ = EvalSlice(name="test", df=df)

# v0.7 — inline leakage validation:
result = evaluate(
    {"s": _Scorer()},
    [slice_],
    run_id="r",
    leakage_checks=[NormalizedFormLeakageCheck(), LabelConflictCheck()],
    on_leakage="record",     # or "raise" (default) / "skip"
    on_scorer_error="raise",  # or "record"
)

6. New `RunResult` fields#

RunResult gains three additive fields:

by_fold: dict[str, RunResult] — populated by evaluate_folded; empty for non-folded runs.
fold_summary: dict — auto-CV-CI summary; empty for non-folded.
schema_version: str = "v1" — for downstream JSON parsers.

Old code that does result.by_slice[name]["by_scorer"][...] is unchanged.

v0.6.x → v0.7.x migration#

At a glance#

1. The select_threshold migration#

Decoding the v0.7 TypeError#

Worked example#

Naming-only difference#

2. OperatingPoint Literal alias removed#

3. select_threshold module location#