# Migrating to v0.46.0 eval-toolkit v0.46.0 introduces a new primary metric surface and **soft-deprecates** the top-level scalar metric imports. This guide covers the consumer-side changes. > v0.46 is **soft-breaking**: existing code using top-level scalar imports continues to work but emits `DeprecationWarning`. At v0.47, the deprecated imports become hard `AttributeError`. Use the v0.46 cycle to migrate. ## TL;DR | Before (v0.45 and earlier) | After (v0.46+) | |---|---| | `from eval_toolkit import pr_auc` | `from eval_toolkit import scorecard, metric_specs as ms` | | `auc = pr_auc(y_true, y_score)` | `r = scorecard(y_true, y_score, metrics=[ms.pr_auc], bootstrap=False); auc = r["pr_auc"].value` | | `from eval_toolkit import bootstrap_ci; ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000)` | `r = scorecard(y, s, metrics=[ms.pr_auc], bootstrap=True, n_resamples=1000); auc, ci = r["pr_auc"].value, r["pr_auc"].ci` | ## What changed ### 1. New `scorecard()` primary metric surface A single function that computes multiple threshold-free metrics + bootstrap CIs on one slice. Returns a `Scorecard` (`Mapping[str, MetricResult]`): ```text from eval_toolkit import scorecard, metric_specs as ms r = scorecard( y_true, y_score, metrics=[ms.pr_auc, ms.roc_auc, ms.brier, ms.ece(n_bins=15)], bootstrap=True, n_resamples=1000, confidence=0.95, rng=0, ) # Dict-subscript access (type-safe under mypy --strict): r["pr_auc"].value # 0.873 r["pr_auc"].status # 'ok' | 'skipped' | 'error' r["pr_auc"].ci # BootstrapCI(low=0.84, high=0.90, ...) r["pr_auc"].reason # '' when ok; explanation when skipped/error r.to_dict() # JSON-friendly dict r.to_pandas() # one-row DataFrame (requires [dataframe] extra) ``` ### 2. Soft-deprecated top-level scalar imports These 8 names are no longer in `eval_toolkit.__all__` and emit `DeprecationWarning` on lookup. They will be hard-removed at v0.47. - `pr_auc` - `roc_auc` - `brier_score` - `expected_calibration_error` - `expected_calibration_error_debiased` - `expected_calibration_error_equal_mass` - `expected_calibration_error_l2` - `expected_calibration_error_l2_debiased` ### 3. Submodule path remains as the **internal-API escape hatch** `from eval_toolkit.metrics import pr_auc` works at v0.46, v0.47, and v1.0+ — **without** deprecation warning. This is intended for: - Power-user / Monte-Carlo inner loops where `scorecard()` orchestration cost is too high. - Custom `MetricSpec` implementations that wrap the scalar function. ⚠️ The submodule path is documented as **internal API** per [ADR 0002](../adr/0002-scorecard-as-primary-metric-surface.md). It is **not part of the v1.0 strict stability contract** and may be refactored in major versions. Use `scorecard()` for code that needs the v1.0 stability promise. ### 4. Threshold-dependent metrics are NOT in `metric_specs` `f1`, `accuracy`, `precision`, `recall` are intentionally absent from the v0.46 first-party spec namespace per [Decision R](../adr/0002-scorecard-as-primary-metric-surface.md) of the v1.0 plan. They need a threshold, and threshold provenance is its own concern. To compute them, use the existing operating-point machinery: ```text from eval_toolkit import MaxF1Selector, metrics_at_threshold # Step 1: select a threshold from val selector_result = MaxF1Selector().select(y_val, score_val) # Step 2: compute metrics at that threshold on test m = metrics_at_threshold(y_test, score_test, threshold=selector_result.threshold) m["f1"], m["precision"], m["recall"] ``` If your eval pipeline calls F1 / accuracy / precision / recall on a separate threshold-free path (e.g., for paired bootstrap), see `paired_bootstrap_op_point_diff` for the threshold-aware paired-comparison helper. ## Migration recipes ### Scalar PR-AUC → scorecard **Before:** ```text from eval_toolkit import pr_auc auc = pr_auc(y_true, y_score) ``` **After:** ```text from eval_toolkit import scorecard, metric_specs as ms auc = scorecard(y_true, y_score, metrics=[ms.pr_auc], bootstrap=False)["pr_auc"].value ``` Or, if you want to keep the scalar shape locally: ```text from eval_toolkit.metrics import pr_auc # internal API; no warning auc = pr_auc(y_true, y_score) ``` ### Bootstrap CI on a metric → scorecard with `bootstrap=True` **Before:** ```text from eval_toolkit import pr_auc, bootstrap_ci auc = pr_auc(y, s) ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000, rng=0) ``` **After:** ```text from eval_toolkit import scorecard, metric_specs as ms r = scorecard(y, s, metrics=[ms.pr_auc], bootstrap=True, n_resamples=1000, rng=0) auc = r["pr_auc"].value ci = r["pr_auc"].ci ``` ### Multiple metrics at once → batch them in `scorecard` **Before:** ```text from eval_toolkit import pr_auc, roc_auc, brier_score, expected_calibration_error results = { "pr_auc": pr_auc(y, s), "roc_auc": roc_auc(y, s), "brier": brier_score(y, s), "ece": expected_calibration_error(y, s, n_bins=15), } ``` **After:** ```text from eval_toolkit import scorecard, metric_specs as ms r = scorecard(y, s, metrics=[ ms.pr_auc, ms.roc_auc, ms.brier, ms.ece(n_bins=15), ], bootstrap=False) # Subscript access via stable string keys: results = {name: r[name].value for name in r} ``` ### Single-class slice safety **Before:** PR-AUC on a single-class slice silently returned a degenerate value (`1.0` or `0.0` per sklearn defaults). Downstream artifacts contained misleading evidence. **After:** `scorecard()` returns `MetricResult(status="skipped", value=None, reason="pr_auc not defined on single-class slice")`. The whole scorecard still computes; other metrics that ARE defined on single-class (Brier, ECE) still produce `status="ok"`. ```text import numpy as np r = scorecard(np.zeros(100, dtype=int), np.random.random(100), metrics=[ms.pr_auc, ms.brier], bootstrap=False) r["pr_auc"].status # 'skipped' r["pr_auc"].value # None r["brier"].status # 'ok' r["brier"].value # 0.4... ``` ### Custom user metrics (third-party specs) `MetricSpec` is a structural Protocol — any class exposing `name: str` and `compute(y_true, y_score) -> float` satisfies it. ```text from eval_toolkit import scorecard, MetricSpec class _MyMetric: name = "my_metric" def compute(self, y_true, y_score): return float(...) assert isinstance(_MyMetric(), MetricSpec) r = scorecard(y, s, metrics=[_MyMetric()], bootstrap=True) r["my_metric"].value # whatever compute() returned ``` ## Treating `DeprecationWarning` as an error in your CI To catch every deprecated import in your test suite, set: ```bash pytest -W error::DeprecationWarning ... ``` Or in `pyproject.toml`: ```toml [tool.pytest.ini_options] filterwarnings = [ "error::DeprecationWarning:eval_toolkit", ] ``` That will surface every top-level scalar metric import as a CI failure, making the migration audit mechanical. ## v0.47 hard removal At v0.47.0, the `__getattr__` deprecation branch is deleted. The 8 deprecated names raise `AttributeError` at the top level. The submodule path stays. **Plan to be fully migrated off the top-level scalars before bumping the consumer pin to v0.47.** ## Open questions / future work - **Operating-point spec family.** v0.46 ships threshold-free specs only. If user demand for F1 / accuracy / precision / recall via the scorecard surface surfaces, v1.x can add a separate `op_metric_specs` namespace with explicit threshold provenance + CI-policy contract. Deferred to v1.x per Decision R. - **`MetricSpec` Protocol additions.** Tier-2 contract is frozen at v1.0 modulo additive subprotocols. Future enhancements (e.g., a `is_defined_on(y_true)` method to replace the centralized `is_metric_defined_for_slice` lookup) would land as a subprotocol. ## References - [ADR 0002 — scorecard as primary metric surface](../adr/0002-scorecard-as-primary-metric-surface.md) - v1.0 plan: `~/.claude/plans/evaluate-all-the-work-twinkly-kite.md` - [`docs/source/examples/scorecard.md`](../examples/) — worked example (to be added) - [Issue #36](https://github.com/brandon-behring/eval-toolkit/issues/36)