# Migrating to v0.46.0

eval-toolkit v0.46.0 introduces a new primary metric surface and **soft-deprecates** the top-level scalar metric imports. This guide covers the consumer-side changes.

> v0.46 is **soft-breaking**: existing code using top-level scalar imports continues to work but emits `DeprecationWarning`. At v0.47, the deprecated imports become hard `AttributeError`. Use the v0.46 cycle to migrate.

## TL;DR

| Before (v0.45 and earlier) | After (v0.46+) |
|---|---|
| `from eval_toolkit import pr_auc` | `from eval_toolkit import scorecard, metric_specs as ms` |
| `auc = pr_auc(y_true, y_score)` | `r = scorecard(y_true, y_score, metrics=[ms.pr_auc], bootstrap=False); auc = r["pr_auc"].value` |
| `from eval_toolkit import bootstrap_ci; ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000)` | `r = scorecard(y, s, metrics=[ms.pr_auc], bootstrap=True, n_resamples=1000); auc, ci = r["pr_auc"].value, r["pr_auc"].ci` |

## What changed

### 1. New `scorecard()` primary metric surface

A single function that computes multiple threshold-free metrics + bootstrap CIs on one slice. Returns a `Scorecard` (`Mapping[str, MetricResult]`):

```text
from eval_toolkit import scorecard, metric_specs as ms

r = scorecard(
    y_true, y_score,
    metrics=[ms.pr_auc, ms.roc_auc, ms.brier, ms.ece(n_bins=15)],
    bootstrap=True,
    n_resamples=1000,
    confidence=0.95,
    rng=0,
)

# Dict-subscript access (type-safe under mypy --strict):
r["pr_auc"].value       # 0.873
r["pr_auc"].status      # 'ok' | 'skipped' | 'error'
r["pr_auc"].ci          # BootstrapCI(low=0.84, high=0.90, ...)
r["pr_auc"].reason      # '' when ok; explanation when skipped/error
r.to_dict()             # JSON-friendly dict
r.to_pandas()           # one-row DataFrame (requires [dataframe] extra)
```

### 2. Soft-deprecated top-level scalar imports

These 8 names are no longer in `eval_toolkit.__all__` and emit `DeprecationWarning` on lookup. They will be hard-removed at v0.47.

- `pr_auc`
- `roc_auc`
- `brier_score`
- `expected_calibration_error`
- `expected_calibration_error_debiased`
- `expected_calibration_error_equal_mass`
- `expected_calibration_error_l2`
- `expected_calibration_error_l2_debiased`

### 3. Submodule path remains as the **internal-API escape hatch**

`from eval_toolkit.metrics import pr_auc` works at v0.46, v0.47, and v1.0+ — **without** deprecation warning. This is intended for:

- Power-user / Monte-Carlo inner loops where `scorecard()` orchestration cost is too high.
- Custom `MetricSpec` implementations that wrap the scalar function.

⚠️ The submodule path is documented as **internal API** per [ADR 0002](../adr/0002-scorecard-as-primary-metric-surface.md). It is **not part of the v1.0 strict stability contract** and may be refactored in major versions. Use `scorecard()` for code that needs the v1.0 stability promise.

### 4. Threshold-dependent metrics are NOT in `metric_specs`

`f1`, `accuracy`, `precision`, `recall` are intentionally absent from the v0.46 first-party spec namespace per [Decision R](../adr/0002-scorecard-as-primary-metric-surface.md) of the v1.0 plan. They need a threshold, and threshold provenance is its own concern. To compute them, use the existing operating-point machinery:

```text
from eval_toolkit import MaxF1Selector, metrics_at_threshold

# Step 1: select a threshold from val
selector_result = MaxF1Selector().select(y_val, score_val)

# Step 2: compute metrics at that threshold on test
m = metrics_at_threshold(y_test, score_test, threshold=selector_result.threshold)
m["f1"], m["precision"], m["recall"]
```

If your eval pipeline calls F1 / accuracy / precision / recall on a separate threshold-free path (e.g., for paired bootstrap), see `paired_bootstrap_op_point_diff` for the threshold-aware paired-comparison helper.

## Migration recipes

### Scalar PR-AUC → scorecard

**Before:**
```text
from eval_toolkit import pr_auc
auc = pr_auc(y_true, y_score)
```

**After:**
```text
from eval_toolkit import scorecard, metric_specs as ms
auc = scorecard(y_true, y_score, metrics=[ms.pr_auc], bootstrap=False)["pr_auc"].value
```

Or, if you want to keep the scalar shape locally:

```text
from eval_toolkit.metrics import pr_auc  # internal API; no warning
auc = pr_auc(y_true, y_score)
```

### Bootstrap CI on a metric → scorecard with `bootstrap=True`

**Before:**
```text
from eval_toolkit import pr_auc, bootstrap_ci
auc = pr_auc(y, s)
ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000, rng=0)
```

**After:**
```text
from eval_toolkit import scorecard, metric_specs as ms
r = scorecard(y, s, metrics=[ms.pr_auc], bootstrap=True, n_resamples=1000, rng=0)
auc = r["pr_auc"].value
ci = r["pr_auc"].ci
```

### Multiple metrics at once → batch them in `scorecard`

**Before:**
```text
from eval_toolkit import pr_auc, roc_auc, brier_score, expected_calibration_error
results = {
    "pr_auc": pr_auc(y, s),
    "roc_auc": roc_auc(y, s),
    "brier": brier_score(y, s),
    "ece": expected_calibration_error(y, s, n_bins=15),
}
```

**After:**
```text
from eval_toolkit import scorecard, metric_specs as ms
r = scorecard(y, s, metrics=[
    ms.pr_auc, ms.roc_auc, ms.brier, ms.ece(n_bins=15),
], bootstrap=False)
# Subscript access via stable string keys:
results = {name: r[name].value for name in r}
```

### Single-class slice safety

**Before:** PR-AUC on a single-class slice silently returned a degenerate value (`1.0` or `0.0` per sklearn defaults). Downstream artifacts contained misleading evidence.

**After:** `scorecard()` returns `MetricResult(status="skipped", value=None, reason="pr_auc not defined on single-class slice")`. The whole scorecard still computes; other metrics that ARE defined on single-class (Brier, ECE) still produce `status="ok"`.

```text
import numpy as np
r = scorecard(np.zeros(100, dtype=int), np.random.random(100),
              metrics=[ms.pr_auc, ms.brier], bootstrap=False)
r["pr_auc"].status   # 'skipped'
r["pr_auc"].value    # None
r["brier"].status    # 'ok'
r["brier"].value     # 0.4...
```

### Custom user metrics (third-party specs)

`MetricSpec` is a structural Protocol — any class exposing `name: str` and `compute(y_true, y_score) -> float` satisfies it.

```text
from eval_toolkit import scorecard, MetricSpec

class _MyMetric:
    name = "my_metric"
    def compute(self, y_true, y_score):
        return float(...)

assert isinstance(_MyMetric(), MetricSpec)
r = scorecard(y, s, metrics=[_MyMetric()], bootstrap=True)
r["my_metric"].value   # whatever compute() returned
```

## Treating `DeprecationWarning` as an error in your CI

To catch every deprecated import in your test suite, set:

```bash
pytest -W error::DeprecationWarning ...
```

Or in `pyproject.toml`:

```toml
[tool.pytest.ini_options]
filterwarnings = [
    "error::DeprecationWarning:eval_toolkit",
]
```

That will surface every top-level scalar metric import as a CI failure, making the migration audit mechanical.

## v0.47 hard removal

At v0.47.0, the `__getattr__` deprecation branch is deleted. The 8 deprecated names raise `AttributeError` at the top level. The submodule path stays. **Plan to be fully migrated off the top-level scalars before bumping the consumer pin to v0.47.**

## Open questions / future work

- **Operating-point spec family.** v0.46 ships threshold-free specs only. If user demand for F1 / accuracy / precision / recall via the scorecard surface surfaces, v1.x can add a separate `op_metric_specs` namespace with explicit threshold provenance + CI-policy contract. Deferred to v1.x per Decision R.
- **`MetricSpec` Protocol additions.** Tier-2 contract is frozen at v1.0 modulo additive subprotocols. Future enhancements (e.g., a `is_defined_on(y_true)` method to replace the centralized `is_metric_defined_for_slice` lookup) would land as a subprotocol.

## References

- [ADR 0002 — scorecard as primary metric surface](../adr/0002-scorecard-as-primary-metric-surface.md)
- v1.0 plan: `~/.claude/plans/evaluate-all-the-work-twinkly-kite.md`
- [`docs/source/examples/scorecard.md`](../examples/) — worked example (to be added)
- [Issue #36](https://github.com/brandon-behring/eval-toolkit/issues/36)