Migrating to v0.46.0#

eval-toolkit v0.46.0 introduces a new primary metric surface and soft-deprecates the top-level scalar metric imports. This guide covers the consumer-side changes.

v0.46 is soft-breaking: existing code using top-level scalar imports continues to work but emits DeprecationWarning. At v0.47, the deprecated imports become hard AttributeError. Use the v0.46 cycle to migrate.

TL;DR#

Before (v0.45 and earlier)	After (v0.46+)
`from eval_toolkit import pr_auc`	`from eval_toolkit import scorecard, metric_specs as ms`
`auc = pr_auc(y_true, y_score)`	`r = scorecard(y_true, y_score, metrics=[ms.pr_auc], bootstrap=False); auc = r["pr_auc"].value`
`from eval_toolkit import bootstrap_ci; ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000)`	`r = scorecard(y, s, metrics=[ms.pr_auc], bootstrap=True, n_resamples=1000); auc, ci = r["pr_auc"].value, r["pr_auc"].ci`

What changed#

1. New `scorecard()` primary metric surface#

A single function that computes multiple threshold-free metrics + bootstrap CIs on one slice. Returns a Scorecard (Mapping[str, MetricResult]):

from eval_toolkit import scorecard, metric_specs as ms

r = scorecard(
    y_true, y_score,
    metrics=[ms.pr_auc, ms.roc_auc, ms.brier, ms.ece(n_bins=15)],
    bootstrap=True,
    n_resamples=1000,
    confidence=0.95,
    rng=0,
)

# Dict-subscript access (type-safe under mypy --strict):
r["pr_auc"].value       # 0.873
r["pr_auc"].status      # 'ok' | 'skipped' | 'error'
r["pr_auc"].ci          # BootstrapCI(low=0.84, high=0.90, ...)
r["pr_auc"].reason      # '' when ok; explanation when skipped/error
r.to_dict()             # JSON-friendly dict
r.to_pandas()           # one-row DataFrame (requires [dataframe] extra)

2. Soft-deprecated top-level scalar imports#

These 8 names are no longer in eval_toolkit.__all__ and emit DeprecationWarning on lookup. They will be hard-removed at v0.47.

pr_auc
roc_auc
brier_score
expected_calibration_error
expected_calibration_error_debiased
expected_calibration_error_equal_mass
expected_calibration_error_l2
expected_calibration_error_l2_debiased

3. Submodule path remains as the internal-API escape hatch#

from eval_toolkit.metrics import pr_auc works at v0.46, v0.47, and v1.0+ — without deprecation warning. This is intended for:

Power-user / Monte-Carlo inner loops where scorecard() orchestration cost is too high.
Custom MetricSpec implementations that wrap the scalar function.

⚠️ The submodule path is documented as internal API per ADR 0002. It is not part of the v1.0 strict stability contract and may be refactored in major versions. Use scorecard() for code that needs the v1.0 stability promise.

4. Threshold-dependent metrics are NOT in `metric_specs`#

f1, accuracy, precision, recall are intentionally absent from the v0.46 first-party spec namespace per Decision R of the v1.0 plan. They need a threshold, and threshold provenance is its own concern. To compute them, use the existing operating-point machinery:

from eval_toolkit import MaxF1Selector, metrics_at_threshold

# Step 1: select a threshold from val
selector_result = MaxF1Selector().select(y_val, score_val)

# Step 2: compute metrics at that threshold on test
m = metrics_at_threshold(y_test, score_test, threshold=selector_result.threshold)
m["f1"], m["precision"], m["recall"]

If your eval pipeline calls F1 / accuracy / precision / recall on a separate threshold-free path (e.g., for paired bootstrap), see paired_bootstrap_op_point_diff for the threshold-aware paired-comparison helper.

Migration recipes#

Scalar PR-AUC → scorecard#

Before:

from eval_toolkit import pr_auc
auc = pr_auc(y_true, y_score)

After:

from eval_toolkit import scorecard, metric_specs as ms
auc = scorecard(y_true, y_score, metrics=[ms.pr_auc], bootstrap=False)["pr_auc"].value

Or, if you want to keep the scalar shape locally:

from eval_toolkit.metrics import pr_auc  # internal API; no warning
auc = pr_auc(y_true, y_score)

Bootstrap CI on a metric → scorecard with `bootstrap=True`#

Before:

from eval_toolkit import pr_auc, bootstrap_ci
auc = pr_auc(y, s)
ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000, rng=0)

After:

from eval_toolkit import scorecard, metric_specs as ms
r = scorecard(y, s, metrics=[ms.pr_auc], bootstrap=True, n_resamples=1000, rng=0)
auc = r["pr_auc"].value
ci = r["pr_auc"].ci

Multiple metrics at once → batch them in `scorecard`#

Before:

from eval_toolkit import pr_auc, roc_auc, brier_score, expected_calibration_error
results = {
    "pr_auc": pr_auc(y, s),
    "roc_auc": roc_auc(y, s),
    "brier": brier_score(y, s),
    "ece": expected_calibration_error(y, s, n_bins=15),
}

After:

from eval_toolkit import scorecard, metric_specs as ms
r = scorecard(y, s, metrics=[
    ms.pr_auc, ms.roc_auc, ms.brier, ms.ece(n_bins=15),
], bootstrap=False)
# Subscript access via stable string keys:
results = {name: r[name].value for name in r}

Single-class slice safety#

Before: PR-AUC on a single-class slice silently returned a degenerate value (1.0 or 0.0 per sklearn defaults). Downstream artifacts contained misleading evidence.

After: scorecard() returns MetricResult(status="skipped", value=None, reason="pr_auc not defined on single-class slice"). The whole scorecard still computes; other metrics that ARE defined on single-class (Brier, ECE) still produce status="ok".

import numpy as np
r = scorecard(np.zeros(100, dtype=int), np.random.random(100),
              metrics=[ms.pr_auc, ms.brier], bootstrap=False)
r["pr_auc"].status   # 'skipped'
r["pr_auc"].value    # None
r["brier"].status    # 'ok'
r["brier"].value     # 0.4...

Custom user metrics (third-party specs)#

MetricSpec is a structural Protocol — any class exposing name: str and compute(y_true, y_score) -> float satisfies it.

from eval_toolkit import scorecard, MetricSpec

class _MyMetric:
    name = "my_metric"
    def compute(self, y_true, y_score):
        return float(...)

assert isinstance(_MyMetric(), MetricSpec)
r = scorecard(y, s, metrics=[_MyMetric()], bootstrap=True)
r["my_metric"].value   # whatever compute() returned

Treating `DeprecationWarning` as an error in your CI#

To catch every deprecated import in your test suite, set:

pytest -W error::DeprecationWarning ...

Or in pyproject.toml:

[tool.pytest.ini_options]
filterwarnings = [
    "error::DeprecationWarning:eval_toolkit",
]

That will surface every top-level scalar metric import as a CI failure, making the migration audit mechanical.

v0.47 hard removal#

At v0.47.0, the __getattr__ deprecation branch is deleted. The 8 deprecated names raise AttributeError at the top level. The submodule path stays. Plan to be fully migrated off the top-level scalars before bumping the consumer pin to v0.47.

Open questions / future work#

Operating-point spec family. v0.46 ships threshold-free specs only. If user demand for F1 / accuracy / precision / recall via the scorecard surface surfaces, v1.x can add a separate op_metric_specs namespace with explicit threshold provenance + CI-policy contract. Deferred to v1.x per Decision R.
MetricSpec Protocol additions. Tier-2 contract is frozen at v1.0 modulo additive subprotocols. Future enhancements (e.g., a is_defined_on(y_true) method to replace the centralized is_metric_defined_for_slice lookup) would land as a subprotocol.

References#

ADR 0002 — scorecard as primary metric surface
v1.0 plan: ~/.claude/plans/evaluate-all-the-work-twinkly-kite.md
docs/source/examples/scorecard.md — worked example (to be added)
Issue #36

Migrating to v0.46.0#

TL;DR#

What changed#

1. New scorecard() primary metric surface#

2. Soft-deprecated top-level scalar imports#

3. Submodule path remains as the internal-API escape hatch#

4. Threshold-dependent metrics are NOT in metric_specs#

Migration recipes#

Scalar PR-AUC → scorecard#

Bootstrap CI on a metric → scorecard with bootstrap=True#

Multiple metrics at once → batch them in scorecard#

Single-class slice safety#

Custom user metrics (third-party specs)#

Treating DeprecationWarning as an error in your CI#

v0.47 hard removal#

Open questions / future work#

References#

1. New `scorecard()` primary metric surface#

4. Threshold-dependent metrics are NOT in `metric_specs`#

Bootstrap CI on a metric → scorecard with `bootstrap=True`#

Multiple metrics at once → batch them in `scorecard`#

Treating `DeprecationWarning` as an error in your CI#