ADR 0002: scorecard() as the primary v1.0 metric surface#
Status: Accepted Date: 2026-05-21 Deciders: Brandon Behring (author), Round 5 audit (Codex + Gemini methodology cross-review) Supersedes: N/A. Superseded by: N/A.
Context#
eval-toolkit v0.x exposed per-metric functions (pr_auc, roc_auc, brier_score,
the ECE family) as top-level imports. A typical eval pipeline imported each one
separately, computed a scalar, and called bootstrap_ci separately to get a
confidence interval. Issue #36
asked for an inline-CI surface on every metric (Inspect-AI / lm-eval scorecard
pattern). The single downstream consumer
(prompt-injection-detection-submission) filed two API sketches: a per-metric
with_ci=True kwarg vs. a top-level scorecard() factory.
Round 5 audit (Codex) surfaced a deeper concern about both sketches:
threshold-dependent metrics (F1, accuracy, precision, recall) cannot live in
a (y_true, y_score) -> float contract without baking in a default threshold
(against documented methodology) or hiding threshold provenance inside a spec.
The v1.0 release window is the cheapest moment to choose ONE primary surface;
post-v1.0 the contract is frozen (Decision M tiered stability — top-level
__all__ strict, only additive subprotocols).
Decision#
At v0.46.0, ship a single primary metric surface:
from eval_toolkit import scorecard, metric_specs as ms
r = scorecard(y_true, y_score,
metrics=[ms.pr_auc, ms.brier, ms.ece(n_bins=15)],
bootstrap=True, n_resamples=1000, confidence=0.95)
r["pr_auc"].value # 0.873
r["pr_auc"].status # 'ok' | 'skipped' | 'error'
r["pr_auc"].ci # BootstrapCI(...)
Specifically:
scorecard()is the public entry point. Returns aScorecard(Mapping[str, MetricResult]). Dict-subscript access by spec name (Decision I — type-safe undermypy --strict; typos raiseKeyErrorinstead of returningAny).metric_specsnamespace ships threshold-free specs only (Decision R).pr_auc/roc_auc/brieras singletons;ece(n_bins, strategy)as a LRU-cached factory. F1 / accuracy / precision / recall remain reachable via the existingmetrics_at_threshold+ThresholdSelectormachinery — they need a threshold, and threshold provenance is its own concern.MetricResultcarries status-aware cells (Decision S). Mirrors the existingMetricStatevocabulary fromartifacts.py:30-61:ok/skipped/error, plus a human-readablereasonand optionalBootstrapCI. Single-class slices producestatus="skipped"rather than raising; per-cell exceptions becomestatus="error"without aborting the whole scorecard.Skipped detection reuses the v0.39 primitive (Decision X.2 + precondition):
is_metric_defined_for_slice(name, is_single_class=...). The primitive was extended at v0.46-prep to recognize bothpr_auc/roc_auc(used by the new scorecard) andauroc/auprc(used by the v0.39 harness). No new public exception type; no new Protocol method.MetricSpecis a Tier-2 Protocol per Decision M — frozen at v1.0 modulo additive subprotocols. Method-signature changes (e.g., adding a threshold parameter tocompute()) would require a SemVer-major bump.Top-level scalar metric imports are soft-deprecated at v0.46 and hard-removed at v0.47 (Decision L). The 8 deprecated names (
pr_auc,roc_auc,brier_score, 5 ECE variants) remain reachable at v0.46 via a transitional__getattr__branch that emitsDeprecationWarning; at v0.47 the branch is deleted (the names raiseAttributeError).The
eval_toolkit.metricssubmodule path is the internal-API escape hatch.from eval_toolkit.metrics import pr_aucworks at v0.46, v0.47, v1.0+ without warning. This is documented as internal API, subject to refactor in major versions — not part of the v1.0 strict stability contract.
Consequences#
Positive:
One canonical metric pipeline. Users learn
scorecard()once; it covers every reasonable threshold-free metric + a uniform CI story.Status-aware cells eliminate a class of evaluation-pipeline bugs where degenerate single-class slices silently produced
1.0PR-AUC values that entered downstream artifacts.Custom user specs (any object satisfying
MetricSpecstructurally) compose alongside first-party specs without registration. The Protocol contract is small (name,compute()) and stable.The submodule escape hatch (
from eval_toolkit.metrics import pr_auc) keeps power-user / Monte-Carlo inner-loop patterns viable without paying thescorecard()orchestration cost.
Negative:
Consumer pays a migration cost: every top-level scalar import either rewrites to
scorecard(...)or moves to the submodule path. v0.46’s DeprecationWarning shim lets this happen gradually across v0.46-v0.47; hard removal at v0.47 forces completion.Threshold-dependent metrics (F1, accuracy, precision, recall) are intentionally NOT in
metric_specsat v0.46. Users wanting “give me F1 with a bootstrap CI” must combineMaxF1Selector+metrics_at_thresholda bootstrap loop themselves. Future work (deferred to v1.x) may add an operating-point spec family with explicit threshold provenance.
Scorecard.to_pandas()requires the[dataframe]extra; the coreto_dict()is stdlib-only.
Alternatives considered#
with_ci=True kwarg on every per-metric function (consumer sketch #1)#
Each metric function takes with_ci: bool = False; when True, returns
(value, BootstrapCI). Rejected because:
Doesn’t address the threshold-dependent metrics gap (F1 etc. still need a threshold argument).
Keeps the per-metric return-shape sprawl as v1.0 contract (sometimes
float, sometimes(float, BootstrapCI), sometimesdict[str, float]— the audit’s #2 finding from the v0.43 design).Forces every scalar metric function to grow a
bootstrapkwarg path — signature noise across ~10 functions vs. onescorecard()orchestrator.
scorecard() + per-metric with_ci=True (Decision G — Round 1 Q1 option C)#
Both surfaces public + maintained. Rejected because Daisy’s “no legacy design at v1.0” stance argued against two-surfaces-for-one-job.
Operating-point spec family alongside threshold-free specs#
A separate op_metric_specs namespace with explicit threshold /
selector / fitted_op provenance + explicit CI-policy contract
(fix-vs-refit per resample). Rejected for v0.46 as scope-creep; deferred to
v1.x if user demand surfaces. The methodology infrastructure is already in
place (metrics_at_threshold, ThresholdSelector, OperatingPointSpec,
FittedOperatingPoint) so adding the spec family is non-breaking when the
need is clear.
Trigger to revisit#
This decision is locked at v1.0 modulo Protocol-additive evolutions. Reopening would require SemVer-major (v2.0). Concrete triggers:
Second production consumer with materially different metric-surface needs. The single-consumer / breaking-OK rationale that made this v1.0 contract cheap stops holding. Re-litigate the contract during v2.0 prep.
Threshold-dependent metric demand from the consumer surfaces as a release-blocker. Add an operating-point spec family (additive at v1.x); consider lifting
MetricSpec.compute()to a richer signature at v2.0.Performance pressure on the orchestration cost.
scorecard()adds a small per-call constant overhead vs. raw scalar functions. Acceptable for offline eval pipelines; a Monte-Carlo inner loop callingscorecard()millions of times would suffer. The submodule escape hatch documents the workaround; if the gap becomes user-visible, consider a fast-path.
References#
v1.0 plan:
~/.claude/plans/evaluate-all-the-work-twinkly-kite.md(Decisions A / B / C / I / J / L / M / R / S / X / Y / Z).Round 5 audit findings:
docs/source/audit_findings.md(F1 / F2 / F4 drove this ADR).Issue #36 — original consumer ask (inline bootstrap CI).
src/eval_toolkit/scorecards.py— implementation (renamed from_scorecard.pyat v0.49.0; see ADR 0004 + STYLE.md §3d for the asymmetric-promotion principle).src/eval_toolkit/metric_specs.py— first-party spec namespace.src/eval_toolkit/artifacts.py:30-61—MetricStatevocabulary reused byMetricResult.