Library-first carryforward refactor v1.2.2 — consume 7 closed eval-toolkit primitives across 6 sites
ADR-066 — Library-first carryforward refactor v1.2.2
Status
Accepted (2026-05-19; additive layer on the v1.2.1 polish — no supersession of any prior ADR; this ADR consumes upstream primitives that resolve 7 previously-filed eval-toolkit issues, replacing 6 local workaround sites with library-first calls).
§A Context
After v1.2.1 close, a survey of deferred work uncovered that all 7 eval-toolkit upstream issues previously documented as “filed; awaiting upstream” in decisions/upstream_issues.md have CLOSED upstream:
| Upstream issue | Title | Status | Upstream resolution |
|---|---|---|---|
| #14 | Add plot_roc_curve (sibling to plot_pr_curve) |
CLOSED | available in eval_toolkit.plot_roc_curve |
| #15 | Add plot_pareto_frontier for cost-vs-performance scatter |
CLOSED | available in eval_toolkit.plot_pareto_frontier |
| #16 | Add plot_slice_metric_heatmap for (group × group × metric) grids |
CLOSED | available in eval_toolkit.plot_slice_metric_heatmap |
| #17 | Add n_jobs to paired_bootstrap_diff |
CLOSED | paired_bootstrap_diff(..., n_jobs=N) |
| #20 | Generalize mde_from_ci to accept BootstrapCI \| PairedBootstrapCI |
CLOSED | et.mde_from_ci accepts both |
| #21 | Add block_bootstrap_on_folds (CV-aware block bootstrap) |
CLOSED | et.block_bootstrap_on_folds |
| #22 | Add ax= kwarg to plot_metric_bars for shared-axes composition |
CLOSED | plot_metric_bars(..., ax=axes) |
Verification (via uv run python -c "import eval_toolkit as et; [hasattr(et, n) for n in ('plot_roc_curve', 'plot_pareto_frontier', 'plot_slice_metric_heatmap', 'plot_metric_bars', 'block_bootstrap_on_folds', 'mde_from_ci', 'paired_bootstrap_diff')]"): all 7 attributes resolve in the currently-pinned eval-toolkit==0.42.0. No version bump required.
Per the strengthened library-first invariant (memory library_first_is_project_wide_invariant, 2026-05-18: “NO local workarounds whatsoever. Primitives belong in eval-toolkit / runpod-deploy / research_toolkit as PyPI deps; missing → upstream MR BLOCKS dependent work”), the 6 local workaround sites that cite these now-closed issues are accumulated maintenance debt. ADR-066 specifies the carryforward refactor pattern + the per-site consumption mapping.
Pattern after ADR-047 (Phase 1 src/data/ carryforward); ADR-056 (v1.0.8 binary-calibrator family consumption); ADR-058 (v1.0.9 eval_from_hub.py wiring).
§B Decision — per-site refactor mapping
Each of the 6 sites replaces its inline workaround with an upstream-primitive call. Project glue (orchestrator wrappers, schema models, data loaders, project-specific overlay logic) STAYS — only the numerical + plotting impls are replaced.
B1 — F1 Pareto frontier (src/eval/figures.py::render_f1_pareto)
- Before: hand-rolled
matplotlibscatter + frontier polyline (project glue callingmatplotlib.pyplot.scatter+ a custom_compute_pareto_frontierhelper) - After:
et.plot_pareto_frontier(costs, perfs, ax=axes, ...)per upstream #15 closure - Project glue retained:
render_f1_paretowrapper (data loader from canonical bootstrap parquets; figure_id provenance viaet.save_figure; ADR-062 caption discipline) - Deleted in same commit:
_compute_pareto_frontierlocal helper + unusednumpyimport if no longer used
B2 — F2 ROC overlay (src/eval/figures.py::render_f2_roc_per_rung)
- Before: hand-rolled ROC overlay (per-rung
plotcalls with custom AUROC text labels) - After:
et.plot_roc_curve(y_true, y_score, label=..., ax=axes)per upstream #14 closure; one call per rung onto a shared axes - Project glue retained: per-rung loop + project-specific label formatting + ADR-064 §B4 xlabel discipline
- Deleted in same commit: hand-rolled AUROC-text positioning logic if upstream’s default placement matches
B3 — F5 Per-slice heatmap (src/eval/figures.py::render_f5_slice_heatmap)
- Before: hand-rolled
matplotlibimshow + tick formatting + cell-value annotation - After:
et.plot_slice_metric_heatmap(values, x_labels, y_labels, ax=axes, ...)per upstream #16 closure - Project glue retained: N/A overlay on single-class slices per ADR-006 + ADR-062 (single-class slices show
N/Arather than 0; the overlay is project-shaped since the slice-name → single-class membership mapping is project-specific) - Deleted in same commit: hand-rolled imshow + tick-label formatting
B4 — F6 LODO breakdown left panel (src/eval/figures.py::render_f6_lodo_breakdown)
- Before: bare-matplotlib bars (manual
ax.bar(...)calls) in the left panel becauseplot_metric_barslacked anax=kwarg - After:
et.plot_metric_bars(metrics, labels, ax=left_ax)per upstream #22 closure - Project glue retained: right-panel reachability asterisks per ADR-025; A-009 audit-flag overlay; subplot composition (left = bars, right = reachability)
- Deleted in same commit: bare-matplotlib bar code in the left panel
B5 — MDE on marginal CIs (src/eval/mde.py::mde_from_marginal_ci_record)
- Before: inline closed-form
MDE = z_alpha * (CI_halfwidth / z_beta_minus_one)workaround (per ADR-006 mandate that EVERY CI gets an MDE; upstreammde_from_cionly accepted PairedBootstrapCI) - After:
et.mde_from_ci(bootstrap_ci_record)per upstream #20 closure (generalized to acceptBootstrapCI | PairedBootstrapCI) - Project glue retained:
MDECellModelschema wrapper (cell_id + fold + seed + provenance fields wrapping the numerical MDE value + CI half-width) - Deleted in same commit: inline closed-form computation + fallback note in
src/eval/schemas.py:336
B6 — Block-bootstrap on folds (src/eval/cross_fold_ci.py::compute_cross_fold_ci_cell)
- Before: inline block-bootstrap-on-folds impl (manual block-resampling + percentile CI) used as the “spoke” complement to
cv_clt_ciper A-008 non-exchangeability auto-flag - After:
et.block_bootstrap_on_folds(per_fold_metrics, n_bootstrap=...)per upstream #21 closure - Project glue retained:
CrossFoldCICellModelschema; auto-flag column logic (block_bootstrap_CI_halfwidth / cv_clt_CI_halfwidth > 1.5→ “LODO non-exchangeability dominates within-fold variance” flag per A-008) - Deleted in same commit: inline block-bootstrap helper function
§C Decision — figure re-render + visual-parity discipline
After Commits 3-6 land the 6 source-code refactors, Commit 7 re-renders F1-F6 via make render-figures. Per Q1 round-8 lock + ADR-062 + ADR-064 §B4:
Visual-parity threshold: spirit-of-original; cosmetic upstream improvements welcomed.
- Required: ADR-locked caption discipline preserved (random-floor annotation on F1 per ADR-062; F2 xlabel
'LoRA AUPRC minus frozen-probe AUPRC (95% CI; whiskers crossing 0 = indistinguishable)'per ADR-064 §B4; F3 colorbar'N/A = single-class slice; AUPRC undefined'; F4 subpanel suptitle mapping; F5 ECE/Brier ylabel gloss) - Required: semantic content matches v1.2.1 baselines (data points; CI bars; rung order; slice ordering)
- Acceptable: cosmetic upstream improvements (palette tones; default tick spacing; legend placement; font sizing)
If an ADR-locked caption-discipline element drops after re-render, the affected figure expands into a focused figure-tuning sub-pass (+15-30 min per affected figure).
GH blob fallback note (per ADR-030): the Quarto site is the canonical reading surface; minor stylistic drift on GH blob is acceptable.
§D Decision — no-orphaned-code discipline (refactor commit hygiene)
Per the no-orphaned-code-during-refactor memory invariant: each refactor commit (3-6) DELETES the local hand-roll + any unused imports in the SAME commit as the upstream-primitive call lands. No transition commits with both paths live.
Specific deletions per commit: - Commit 3 (F1 + F2): delete _compute_pareto_frontier helper; delete unused numpy.argsort import if any; delete hand-rolled AUROC-text positioning - Commit 4 (F5 + F6): delete hand-rolled imshow + tick-label code; delete bare-matplotlib ax.bar code from F6 left panel - Commit 5 (mde.py): delete inline closed-form computation; delete fallback note in src/eval/schemas.py:336 - Commit 6 (cross_fold_ci.py): delete inline block-bootstrap helper; delete unused numpy.random import if any
ruff would catch unused imports; manual verification at each commit boundary per the Q2 round-6 self-review directive.
§E Decision — ledger updates (Commit 9 close)
decisions/upstream_issues.md updates: 7 existing rows have Status column updated from “filed; awaiting upstream” → “RESOLVED in eval-toolkit v0.42.0; consumed at v1.2.2 per ADR-066”. 1 new row appended for the v1.2.2 Commit 8 stretch contribution to eval-toolkit #36 (issue-comment + design-sketch).
decisions/library_imports.md updates: existing eval-toolkit imports section gains entries for the 7 newly-consumed primitives (plot_roc_curve, plot_pareto_frontier, plot_slice_metric_heatmap, plot_metric_bars with ax=, mde_from_ci generalized, block_bootstrap_on_folds, paired_bootstrap_diff with n_jobs).
§F Consequences
F1 — Library-first invariant honored
The 6 workaround sites that violated the strengthened library-first invariant are paid down. Future contributors writing new eval-toolkit consumer code can dispatch directly into upstream primitives rather than copy-paste-evolve from the deleted local hand-rolls.
F2 — Audit-trail clarity
The carryforward pattern (consume upstream when it lands; delete local in same commit) is now repeated 4 times (ADR-047 / ADR-056 / ADR-058 / ADR-066). Pattern is durable; future v1.X patches that touch local workarounds can reference this ADR as the canonical pattern.
F3 — Methodology preserved
No methodology change. Headline AUPRC ladder + ablation results + threshold-policy findings are unchanged. The refactor only replaces inline plotting + numerical impls with upstream-canonical versions; the data they’re applied to (canonical bootstrap parquets; per-cell parquets) is unchanged.
F4 — Figure caption discipline preserved
ADR-062 + ADR-064 §B4 caption discipline is preserved per the spirit-of-original visual-parity threshold (Q1 round-8 lock). Random-floor annotation, CI-crossing-zero cue, N/A single-class label, subpanel mapping, ECE/Brier gloss — all preserved.
F5 — Cost-trivial
$0 GPU. CPU-only refactor + figure re-render + ADR drafting. Cumulative project compute spend stays $17.08 (within ADR-020 $200 hard cap per ADR-065 §E).
F6 — Upstream maintainer = consumer-side decision-maker (pattern continued)
Per memory library-first-pattern-paid-off-twice-v1-0-x: filing concrete upstream issues with proposed API + workaround resolves quickly when upstream maintainer = consumer-side decision-maker. v1.2.2 confirms the pattern a 3rd time (7 issues filed at Phase 4 / v1.0.6 → resolved in ~weeks → consumed at v1.2.2).
Linked ADRs
- References:
- ADR-047 — Phase 1
src/data/carryforward refactor (precedent: 4 hand-rolls retrofitted in single commit) - ADR-056 — v1.0.8 binary-calibrator family consumption (precedent: API shape unification after upstream closure)
- ADR-058 — v1.0.9
eval_from_hub.pywiring (precedent: in-place wiring after upstream resolution) - ADR-065 — v1.2.1 closing-polish (immediate predecessor)
- ADR-062 — figure caption discipline that visual-parity must preserve
- ADR-064 §B4 — figure caption refinements preserved
- CLAUDE.md — library-first discipline lives in §“Library-first discipline”
- ADR-047 — Phase 1
- Source:
/exploring-optionsrounds 7 + 8 (2026-05-19). Round 7 (6 questions on tier prioritization + ADR structure + pin bump + commit cadence + skip-lines + #36 placement) + Round 8 (4 questions on visual-parity threshold + #36 format + broken-slug-refs approach + ledger format).
Transcript
transcripts/2026-05-19__v1-2-2-library-first-refactor-and-immutability-clarification.md — captures rounds 7 + 8 + the 9-commit execution + per-commit manual-self-review discipline.