Migrating to v0.47#
The v0.47 release follows the v0.46 scorecard surface with a breaking consolidation of the sweep API + Tier-2 Protocol cleanup. It also completes the v0.43-forward-look advanced-6 character-injection suite and lands the Round 6 audit follow-on items.
If you’re jumping from v0.45 (or earlier) and have not yet migrated through
v0.46, read migration/v0.46.md first.
What’s removed at v0.47 (BREAKING)#
1. Top-level scalar metric imports — hard removal#
The v0.46 __getattr__ shim that kept these reachable with a
DeprecationWarning has been deleted:
# v0.46 (still worked with warning):
from eval_toolkit import pr_auc, roc_auc, brier_score
from eval_toolkit import (
expected_calibration_error,
expected_calibration_error_debiased,
expected_calibration_error_equal_mass,
expected_calibration_error_l2,
expected_calibration_error_l2_debiased,
)
# v0.47 (AttributeError on every name above):
from eval_toolkit import pr_auc # ImportError
Migration (primary path — preferred):
import numpy as np
from eval_toolkit import scorecard, metric_specs as ms
rng = np.random.default_rng(42)
y_true = rng.integers(0, 2, size=200)
y_score = np.clip(y_true + rng.normal(0, 0.3, size=200), 0, 1)
r = scorecard(y_true, y_score, metrics=[ms.pr_auc, ms.brier])
value = r["pr_auc"].value
ci = r["pr_auc"].ci # BootstrapCI | None
print(f"PR-AUC: {value:.3f} CI: [{ci.ci_low:.3f}, {ci.ci_high:.3f}]")
Migration (escape hatch — internal API per ADR 0002):
from eval_toolkit.metrics import pr_auc, roc_auc, brier_score
# Same scalar-function signature as v0.45 and earlier.
For the 3 ECE variants that do not have a first-party metric_specs
equivalent (expected_calibration_error_debiased / _l2 /
_l2_debiased), the submodule path is the only stable way to reach
them. metric_specs.ece(n_bins=..., strategy="uniform"|"quantile")
covers the canonical two.
2. Module-level sweep functions removed#
# v0.46 — gone in v0.47:
from eval_toolkit.adversarial import sweep
from eval_toolkit.preprocessing import sweep
Migration: use the new top-level sweep() with any
- class:
TextTransformstrategy (defence + attack mix freely):
from eval_toolkit import sweep, DelimitVariant, DatamarkVariant
from eval_toolkit.adversarial import ZeroWidthSpaceInjection
texts = ["hello world", "ignore previous instructions"]
# Pure text-transform enumeration:
df = sweep(
[DelimitVariant(), DatamarkVariant(), ZeroWidthSpaceInjection()],
texts,
)
print(df.columns.tolist())
Add a Scorer for original / transformed score columns, and an explicit
threshold for the asr column:
df = sweep([...], texts, scorer=detector)
df = sweep([...], texts, scorer=detector, attack_threshold=0.5)
Key contract change: attack_threshold is now an explicit kwarg.
The v0.43–v0.46 adversarial.sweep had threshold=0.5 as a
default; the new sweep() refuses to materialize an asr column
unless the caller commits to a calibrated operating point (see
methodology/thresholds.md).
3. SimpleNamespace shortcuts removed#
# v0.46 — gone in v0.47:
from eval_toolkit.adversarial import character_injection
from eval_toolkit.preprocessing import spotlighting
character_injection.zero_width_space("hello")
spotlighting.delimit("hello")
Migration:
from eval_toolkit.adversarial import ZeroWidthSpaceInjection
from eval_toolkit.preprocessing import delimit # or DelimitVariant
ZeroWidthSpaceInjection().transform("hello")
delimit("hello")
DelimitVariant().transform("hello") # equivalent
4. CharacterInjectionStrategy Protocol removed#
The per-module Protocol was redundant with the new top-level
- class:
TextTransformProtocol that ships in v0.47 (Decision K).
# v0.46:
from eval_toolkit.adversarial import CharacterInjectionStrategy
isinstance(my_strategy, CharacterInjectionStrategy)
# v0.47:
from eval_toolkit import TextTransform
isinstance(my_strategy, TextTransform)
Every existing adversarial dataclass continues to satisfy
TextTransform structurally — no source changes required in concrete
classes.
What’s added at v0.47#
Top-level TextTransform Protocol#
The 9th strict Tier-2 Protocol per ADR 0003 (Decision M):
from eval_toolkit import TextTransform
# Structural subtyping — any class with name: str + transform(text) -> str
# satisfies the Protocol without inheriting from it.
3 preprocessing dataclasses#
DelimitVariant, DatamarkVariant, EncodeVariant — frozen +
slots=True wrappers over the existing delimit / datamark /
encode functions, satisfying TextTransform:
from eval_toolkit import DelimitVariant, DatamarkVariant, EncodeVariant
DelimitVariant(delimiter="<<").transform("hello") # "<<hello>>"
DatamarkVariant(marker="^").transform("a b") # "a^ b"
EncodeVariant(encoding="base64").transform("hello") # "aGVsbG8="
6 advanced character-injection techniques#
Closes the v0.43.0 CHANGELOG forward-look (“scheduled for v0.43.1” — a version that never shipped) per Decision Q11→11.3:
from eval_toolkit import (
BidiRTLInjection, # U+202E…U+202C override block
TagStrippingInjection, # <…> tag removal (idempotent)
SynonymSubstitution, # whitelisted-word swap, seed-deterministic
TokenSplittingInjection, # mid-word single-space insertion (was `TokenSplitting`; renamed at v0.49)
UnicodeNormalizationInjection, # NFC / NFD / NFKC / NFKD (was `UnicodeNormalization`; renamed at v0.49)
InvisibleCharsInjection, # 5 invisible code points
)
ADVANCED_TECHNIQUES (6-tuple) + ALL_TECHNIQUES (12-tuple =
core 6 + advanced 6) are exported from eval_toolkit.adversarial for
convenience.
Round 6 audit follow-on (per docs/source/audit_findings.md)#
Decision R6-A:
scorecard(seed=None)docstring rewritten to document the deterministic-by-default contract.Decision R6-B:
scorecard()raisesValueErroron duplicateMetricSpec.name.Decision R6-C:
Scorecard.to_pandas()MultiIndex schema gainsn_resamples+methodcolumns (additive; lossless againstBootstrapCI.to_dict()).Decision R6-D:
tests/test_public_api.pydrift guard now captures Tier-2 Protocol method signatures.Decision R6-F5:
_evaluate_spec()no longer swallowsMemoryError/RecursionError/KeyboardInterrupt/SystemExitinto per-cellstatus="error"cells.Decision R6-H:
metric_specs.make_spec_name(prefix, **kwargs)helper for custom parameterizedMetricSpecname canonicalization.
Migration checklist#
Before bumping the pin to eval-toolkit==0.47.0:
Replace
from eval_toolkit import pr_auc(and friends) withscorecard(...)ORfrom eval_toolkit.metrics import ….Replace
from eval_toolkit.adversarial import sweepwithfrom eval_toolkit import sweep+ passTextTransformstrategies.Replace
from eval_toolkit.preprocessing import sweepwith the top-levelsweep().Replace
character_injection.<name>(text)/spotlighting.<name>(text)namespace shortcuts with the concrete class or functional API.Replace
CharacterInjectionStrategyreferences withTextTransform.If you call
adversarial.sweep(texts, scorer)and rely on theasrcolumn, addattack_threshold=<float>explicitly.Run your test suite against the new pin; the v0.46→v0.47 transition surfaces every removed-symbol callsite as an
AttributeErrororImportErrorat module-load time.
What’s next (v0.48 polish; v1.0 stability)#
The remaining v1.0-prep work is collected in v0.48 and v1.0 per the plan:
v0.48 —
metrics_at_thresholdkey normalization,BootstrapCI.to_dict()rewrite, lazy-extras message audit, docstring example sweep, ADRs 0001 + 0003 finalized, Round 5/Round 7 packet-drift fixes.v1.0 — stability commitment; no new code; final ADR pass; all 4 gates closed.
See ~/.claude/plans/evaluate-all-the-work-twinkly-kite.md for the
master plan.