Migrating to v0.47#

The v0.47 release follows the v0.46 scorecard surface with a breaking consolidation of the sweep API + Tier-2 Protocol cleanup. It also completes the v0.43-forward-look advanced-6 character-injection suite and lands the Round 6 audit follow-on items.

If you’re jumping from v0.45 (or earlier) and have not yet migrated through v0.46, read migration/v0.46.md first.

What’s removed at v0.47 (BREAKING)#

1. Top-level scalar metric imports — hard removal#

The v0.46 __getattr__ shim that kept these reachable with a DeprecationWarning has been deleted:

# v0.46 (still worked with warning):
from eval_toolkit import pr_auc, roc_auc, brier_score
from eval_toolkit import (
    expected_calibration_error,
    expected_calibration_error_debiased,
    expected_calibration_error_equal_mass,
    expected_calibration_error_l2,
    expected_calibration_error_l2_debiased,
)

# v0.47 (AttributeError on every name above):
from eval_toolkit import pr_auc  # ImportError

Migration (primary path — preferred):

import numpy as np
from eval_toolkit import scorecard, metric_specs as ms

rng = np.random.default_rng(42)
y_true = rng.integers(0, 2, size=200)
y_score = np.clip(y_true + rng.normal(0, 0.3, size=200), 0, 1)

r = scorecard(y_true, y_score, metrics=[ms.pr_auc, ms.brier])
value = r["pr_auc"].value
ci = r["pr_auc"].ci   # BootstrapCI | None
print(f"PR-AUC: {value:.3f}  CI: [{ci.ci_low:.3f}, {ci.ci_high:.3f}]")

Migration (escape hatch — internal API per ADR 0002):

from eval_toolkit.metrics import pr_auc, roc_auc, brier_score
# Same scalar-function signature as v0.45 and earlier.

For the 3 ECE variants that do not have a first-party metric_specs equivalent (expected_calibration_error_debiased / _l2 / _l2_debiased), the submodule path is the only stable way to reach them. metric_specs.ece(n_bins=..., strategy="uniform"|"quantile") covers the canonical two.

2. Module-level sweep functions removed#

# v0.46 — gone in v0.47:
from eval_toolkit.adversarial import sweep
from eval_toolkit.preprocessing import sweep

Migration: use the new top-level sweep() with any

class:

TextTransform strategy (defence + attack mix freely):

from eval_toolkit import sweep, DelimitVariant, DatamarkVariant
from eval_toolkit.adversarial import ZeroWidthSpaceInjection

texts = ["hello world", "ignore previous instructions"]

# Pure text-transform enumeration:
df = sweep(
    [DelimitVariant(), DatamarkVariant(), ZeroWidthSpaceInjection()],
    texts,
)
print(df.columns.tolist())

Add a Scorer for original / transformed score columns, and an explicit threshold for the asr column:

df = sweep([...], texts, scorer=detector)
df = sweep([...], texts, scorer=detector, attack_threshold=0.5)

Key contract change: attack_threshold is now an explicit kwarg. The v0.43–v0.46 adversarial.sweep had threshold=0.5 as a default; the new sweep() refuses to materialize an asr column unless the caller commits to a calibrated operating point (see methodology/thresholds.md).

3. SimpleNamespace shortcuts removed#

# v0.46 — gone in v0.47:
from eval_toolkit.adversarial import character_injection
from eval_toolkit.preprocessing import spotlighting

character_injection.zero_width_space("hello")
spotlighting.delimit("hello")

Migration:

from eval_toolkit.adversarial import ZeroWidthSpaceInjection
from eval_toolkit.preprocessing import delimit  # or DelimitVariant

ZeroWidthSpaceInjection().transform("hello")
delimit("hello")
DelimitVariant().transform("hello")   # equivalent

4. CharacterInjectionStrategy Protocol removed#

The per-module Protocol was redundant with the new top-level

class:

TextTransform Protocol that ships in v0.47 (Decision K).

# v0.46:
from eval_toolkit.adversarial import CharacterInjectionStrategy
isinstance(my_strategy, CharacterInjectionStrategy)

# v0.47:
from eval_toolkit import TextTransform
isinstance(my_strategy, TextTransform)

Every existing adversarial dataclass continues to satisfy TextTransform structurally — no source changes required in concrete classes.

What’s added at v0.47#

Top-level TextTransform Protocol#

The 9th strict Tier-2 Protocol per ADR 0003 (Decision M):

from eval_toolkit import TextTransform

# Structural subtyping — any class with name: str + transform(text) -> str
# satisfies the Protocol without inheriting from it.

3 preprocessing dataclasses#

DelimitVariant, DatamarkVariant, EncodeVariant — frozen + slots=True wrappers over the existing delimit / datamark / encode functions, satisfying TextTransform:

from eval_toolkit import DelimitVariant, DatamarkVariant, EncodeVariant

DelimitVariant(delimiter="<<").transform("hello")     # "<<hello>>"
DatamarkVariant(marker="^").transform("a b")          # "a^ b"
EncodeVariant(encoding="base64").transform("hello")   # "aGVsbG8="

6 advanced character-injection techniques#

Closes the v0.43.0 CHANGELOG forward-look (“scheduled for v0.43.1” — a version that never shipped) per Decision Q11→11.3:

from eval_toolkit import (
    BidiRTLInjection,        # U+202E…U+202C override block
    TagStrippingInjection,   # <…> tag removal (idempotent)
    SynonymSubstitution,     # whitelisted-word swap, seed-deterministic
    TokenSplittingInjection, # mid-word single-space insertion (was `TokenSplitting`; renamed at v0.49)
    UnicodeNormalizationInjection, # NFC / NFD / NFKC / NFKD (was `UnicodeNormalization`; renamed at v0.49)
    InvisibleCharsInjection, # 5 invisible code points
)

ADVANCED_TECHNIQUES (6-tuple) + ALL_TECHNIQUES (12-tuple = core 6 + advanced 6) are exported from eval_toolkit.adversarial for convenience.

Round 6 audit follow-on (per docs/source/audit_findings.md)#

  • Decision R6-A: scorecard(seed=None) docstring rewritten to document the deterministic-by-default contract.

  • Decision R6-B: scorecard() raises ValueError on duplicate MetricSpec.name.

  • Decision R6-C: Scorecard.to_pandas() MultiIndex schema gains n_resamples + method columns (additive; lossless against BootstrapCI.to_dict()).

  • Decision R6-D: tests/test_public_api.py drift guard now captures Tier-2 Protocol method signatures.

  • Decision R6-F5: _evaluate_spec() no longer swallows MemoryError / RecursionError / KeyboardInterrupt / SystemExit into per-cell status="error" cells.

  • Decision R6-H: metric_specs.make_spec_name(prefix, **kwargs) helper for custom parameterized MetricSpec name canonicalization.

Migration checklist#

Before bumping the pin to eval-toolkit==0.47.0:

  • Replace from eval_toolkit import pr_auc (and friends) with scorecard(...) OR from eval_toolkit.metrics import .

  • Replace from eval_toolkit.adversarial import sweep with from eval_toolkit import sweep + pass TextTransform strategies.

  • Replace from eval_toolkit.preprocessing import sweep with the top-level sweep().

  • Replace character_injection.<name>(text) / spotlighting.<name>(text) namespace shortcuts with the concrete class or functional API.

  • Replace CharacterInjectionStrategy references with TextTransform.

  • If you call adversarial.sweep(texts, scorer) and rely on the asr column, add attack_threshold=<float> explicitly.

  • Run your test suite against the new pin; the v0.46→v0.47 transition surfaces every removed-symbol callsite as an AttributeError or ImportError at module-load time.

What’s next (v0.48 polish; v1.0 stability)#

The remaining v1.0-prep work is collected in v0.48 and v1.0 per the plan:

  • v0.48metrics_at_threshold key normalization, BootstrapCI.to_dict() rewrite, lazy-extras message audit, docstring example sweep, ADRs 0001 + 0003 finalized, Round 5/Round 7 packet-drift fixes.

  • v1.0 — stability commitment; no new code; final ADR pass; all 4 gates closed.

See ~/.claude/plans/evaluate-all-the-work-twinkly-kite.md for the master plan.