# ADR 0002: `scorecard()` as the primary v1.0 metric surface **Status:** Accepted **Date:** 2026-05-21 **Deciders:** Brandon Behring (author), Round 5 audit (Codex + Gemini methodology cross-review) **Supersedes:** N/A. **Superseded by:** N/A. ## Context eval-toolkit v0.x exposed per-metric functions (`pr_auc`, `roc_auc`, `brier_score`, the ECE family) as top-level imports. A typical eval pipeline imported each one separately, computed a scalar, and called `bootstrap_ci` separately to get a confidence interval. Issue [#36](https://github.com/brandon-behring/eval-toolkit/issues/36) asked for an inline-CI surface on every metric (Inspect-AI / lm-eval scorecard pattern). The single downstream consumer (`prompt-injection-detection-submission`) filed two API sketches: a per-metric `with_ci=True` kwarg vs. a top-level `scorecard()` factory. Round 5 audit (Codex) surfaced a deeper concern about both sketches: threshold-dependent metrics (F1, accuracy, precision, recall) cannot live in a `(y_true, y_score) -> float` contract without baking in a default threshold (against documented methodology) or hiding threshold provenance inside a spec. The v1.0 release window is the cheapest moment to choose ONE primary surface; post-v1.0 the contract is frozen (Decision M tiered stability — top-level `__all__` strict, only additive subprotocols). ## Decision At v0.46.0, ship a single primary metric surface: ```python from eval_toolkit import scorecard, metric_specs as ms r = scorecard(y_true, y_score, metrics=[ms.pr_auc, ms.brier, ms.ece(n_bins=15)], bootstrap=True, n_resamples=1000, confidence=0.95) r["pr_auc"].value # 0.873 r["pr_auc"].status # 'ok' | 'skipped' | 'error' r["pr_auc"].ci # BootstrapCI(...) ``` Specifically: 1. **`scorecard()` is the public entry point.** Returns a `Scorecard` (`Mapping[str, MetricResult]`). Dict-subscript access by spec name (Decision I — type-safe under `mypy --strict`; typos raise `KeyError` instead of returning `Any`). 2. **`metric_specs` namespace ships threshold-free specs only** (Decision R). `pr_auc` / `roc_auc` / `brier` as singletons; `ece(n_bins, strategy)` as a LRU-cached factory. F1 / accuracy / precision / recall remain reachable via the existing `metrics_at_threshold` + `ThresholdSelector` machinery — they need a threshold, and threshold provenance is its own concern. 3. **`MetricResult` carries status-aware cells** (Decision S). Mirrors the existing `MetricState` vocabulary from `artifacts.py:30-61`: `ok` / `skipped` / `error`, plus a human-readable `reason` and optional `BootstrapCI`. Single-class slices produce `status="skipped"` rather than raising; per-cell exceptions become `status="error"` without aborting the whole scorecard. 4. **Skipped detection reuses the v0.39 primitive** (Decision X.2 + precondition): `is_metric_defined_for_slice(name, is_single_class=...)`. The primitive was extended at v0.46-prep to recognize both `pr_auc` / `roc_auc` (used by the new scorecard) and `auroc` / `auprc` (used by the v0.39 harness). No new public exception type; no new Protocol method. 5. **`MetricSpec` is a Tier-2 Protocol** per Decision M — frozen at v1.0 modulo additive subprotocols. Method-signature changes (e.g., adding a threshold parameter to `compute()`) would require a SemVer-major bump. 6. **Top-level scalar metric imports are soft-deprecated at v0.46** and hard-removed at v0.47 (Decision L). The 8 deprecated names (`pr_auc`, `roc_auc`, `brier_score`, 5 ECE variants) remain reachable at v0.46 via a transitional `__getattr__` branch that emits `DeprecationWarning`; at v0.47 the branch is deleted (the names raise `AttributeError`). 7. **The `eval_toolkit.metrics` submodule path is the internal-API escape hatch.** `from eval_toolkit.metrics import pr_auc` works at v0.46, v0.47, v1.0+ without warning. **This is documented as internal API, subject to refactor in major versions** — not part of the v1.0 strict stability contract. ## Consequences **Positive:** - One canonical metric pipeline. Users learn `scorecard()` once; it covers every reasonable threshold-free metric + a uniform CI story. - Status-aware cells eliminate a class of evaluation-pipeline bugs where degenerate single-class slices silently produced `1.0` PR-AUC values that entered downstream artifacts. - Custom user specs (any object satisfying `MetricSpec` structurally) compose alongside first-party specs without registration. The Protocol contract is small (`name`, `compute()`) and stable. - The submodule escape hatch (`from eval_toolkit.metrics import pr_auc`) keeps power-user / Monte-Carlo inner-loop patterns viable without paying the `scorecard()` orchestration cost. **Negative:** - Consumer pays a migration cost: every top-level scalar import either rewrites to `scorecard(...)` or moves to the submodule path. v0.46's DeprecationWarning shim lets this happen gradually across v0.46-v0.47; hard removal at v0.47 forces completion. - Threshold-dependent metrics (F1, accuracy, precision, recall) are intentionally NOT in `metric_specs` at v0.46. Users wanting "give me F1 with a bootstrap CI" must combine `MaxF1Selector` + `metrics_at_threshold` + a bootstrap loop themselves. Future work (deferred to v1.x) may add an operating-point spec family with explicit threshold provenance. - `Scorecard.to_pandas()` requires the `[dataframe]` extra; the core `to_dict()` is stdlib-only. ## Alternatives considered ### `with_ci=True` kwarg on every per-metric function (consumer sketch #1) Each metric function takes `with_ci: bool = False`; when `True`, returns `(value, BootstrapCI)`. Rejected because: - Doesn't address the threshold-dependent metrics gap (F1 etc. still need a threshold argument). - Keeps the per-metric return-shape sprawl as v1.0 contract (sometimes `float`, sometimes `(float, BootstrapCI)`, sometimes `dict[str, float]` — the audit's #2 finding from the v0.43 design). - Forces every scalar metric function to grow a `bootstrap` kwarg path — signature noise across ~10 functions vs. one `scorecard()` orchestrator. ### `scorecard()` + per-metric `with_ci=True` (Decision G — Round 1 Q1 option C) Both surfaces public + maintained. Rejected because Daisy's "no legacy design at v1.0" stance argued against two-surfaces-for-one-job. ### Operating-point spec family alongside threshold-free specs A separate `op_metric_specs` namespace with explicit `threshold` / `selector` / `fitted_op` provenance + explicit CI-policy contract (fix-vs-refit per resample). Rejected for v0.46 as scope-creep; deferred to v1.x if user demand surfaces. The methodology infrastructure is already in place (`metrics_at_threshold`, `ThresholdSelector`, `OperatingPointSpec`, `FittedOperatingPoint`) so adding the spec family is non-breaking when the need is clear. ## Trigger to revisit This decision is locked at v1.0 modulo Protocol-additive evolutions. Reopening would require **SemVer-major (v2.0)**. Concrete triggers: - **Second production consumer with materially different metric-surface needs.** The single-consumer / breaking-OK rationale that made this v1.0 contract cheap stops holding. Re-litigate the contract during v2.0 prep. - **Threshold-dependent metric demand from the consumer** surfaces as a release-blocker. Add an operating-point spec family (additive at v1.x); consider lifting `MetricSpec.compute()` to a richer signature at v2.0. - **Performance pressure on the orchestration cost.** `scorecard()` adds a small per-call constant overhead vs. raw scalar functions. Acceptable for offline eval pipelines; a Monte-Carlo inner loop calling `scorecard()` millions of times would suffer. The submodule escape hatch documents the workaround; if the gap becomes user-visible, consider a fast-path. ## References - v1.0 plan: `~/.claude/plans/evaluate-all-the-work-twinkly-kite.md` (Decisions A / B / C / I / J / L / M / R / S / X / Y / Z). - Round 5 audit findings: `docs/source/audit_findings.md` (F1 / F2 / F4 drove this ADR). - Issue [#36](https://github.com/brandon-behring/eval-toolkit/issues/36) — original consumer ask (inline bootstrap CI). - `src/eval_toolkit/scorecards.py` — implementation (renamed from `_scorecard.py` at v0.49.0; see ADR 0004 + STYLE.md §3d for the asymmetric-promotion principle). - `src/eval_toolkit/metric_specs.py` — first-party spec namespace. - `src/eval_toolkit/artifacts.py:30-61` — `MetricState` vocabulary reused by `MetricResult`.