Phase 4 analysis implementation bundle — 6-commit cadence + scaffold-with-classical + always-emit-both-CIs auto-flag + MDE-on-every-emitted-CI + LLM-rater audit included (user-overridden from defer) + library-first hybrid figures + Phase 5 prep deferred
Superseded on one or more axes by ADR-062. The body below retains its original prose per the ADR-073 immutability rule; the corrected position lives in the superseding ADR. See the Decisions index to navigate.
ADR-046: Phase 4 analysis implementation bundle
Status
Accepted (2026-05-16). Does not supersede any prior ADR; closes seven implementation-level decisions left open after Phase 0 + Phase 3.
Context
Phase 4 (Analysis) was unblocked at Phase 3 close (commits 8b96946..8272747 pushed to origin/main 2026-05-16; ADR-045 closed). docs/ROADMAP.md:75 confirms Phase 4 entry once Phase 3’s statistical-inference apparatus (paired-bootstrap battery + dual-policy thresholds + calibration battery) shipped.
The Phase 4 entry walkthrough generated 7 numbered /exploring-options questions following Phase 2 (ADR-044) + Phase 3 (ADR-045) precedent. Two notable user interventions during the walkthrough:
- Q5 user override — the initial recommendation was to defer the LLM-rater reference-scorer audit per ROADMAP Phase 4 line 95
[TBD-at-Phase-4]cautionary framing. The user explicitly overrode and locked include-now — citing the value of front-loading the audit deliverable rather than waiting for an as-yet-unfired regex-tagger trigger. - Q6 library-first reaffirmation — the initial recommendation was hand-rolled matplotlib for the 7-figure slate. The user pushed back:
eval-toolkitshipsplotting.pywith 7 plot helpers plussave_figureplusPALETTEplusset_plot_style— and the project-wide library-first invariant requires consuming these before any local glue. This reframing escalated into a project-wide invariant reaffirmation that triggered a retroactive Phase 1 audit + 4-commit carryforward refactor per ADR-047 (closed at3615148immediately before this ADR lands).
Pre-existing surface inherited by Phase 4
Phase 0 + Phase 3 locked the methodology surface for analysis:
- ADR-006 — MDE estimated for every reported CI; 3-seed multi-seed floor.
- ADR-018 — 4 reference rungs + per-axis matched-budget framing.
- ADR-021 — pooled-headline + per-slice-spoke aggregation; recall@FPR pinpoint triad.
- ADR-022 — paired_bootstrap_diff per-row for trained-vs-trained; 10K bootstrap @ seed=1 headline + 10K @ seed=2 stability check; bootstrap apparatus.
- ADR-023 — full calibration battery (4-ECE matrix + Brier + reliability + temperature + isotonic) — landed in Phase 3 Commit 3.
- ADR-024 — cv_clt_ci (Bayle 2020) headline + block-bootstrap-on-folds spoke; A-008 sensitivity-check flag when
block_bootstrap_CI_halfwidth / cv_clt_CI_halfwidth > 1.5. - ADR-025 — dual-policy thresholds; verification-reachability audit per A-009 — landed in Phase 3 Commit 4.
- ADR-031 — WRITEUP hub-and-spoke 8-spoke structure.
- ADR-033 —
v0.9.0-rc1rehearsal tag (24+ hour dress-rehearsal post-Phase-4-close). - ADR-038 — phase-tailoring lock; Phase 3+4 collapse rejected.
- ADR-045 — Phase 3 evaluation implementation bundle (Phase 3 close; full-pairwise paired-bootstrap persistence per Q6 user refinement; 30-cell bootstrap battery shipped to
evals/bootstrap/). - ADR-047 — Phase 1 library-first carryforward refactor (closed 3615148 immediately before this ADR; Phase 4 proceeds on clean library-first baseline).
What remained open at Phase 4 entry
Seven implementation-level questions the Phase 0 + Phase 3 ADRs did not specify at code-snippet level:
- The commit cadence — mirror Phase 2 + Phase 3’s proven 6-commit pattern, or decompose differently.
- Data-dependence handling — block Phase 4 on operator-gated canonical GPU runs (72 transformer parquets), or scaffold against classical-floor predictions per Phase 3 Q5 precedent.
- Cross-fold CI spoke ablation — always emit both cv_clt_ci + block-bootstrap (rich audit trail) vs conditional emission (compute saver) vs operator-decided per cell.
- MDE scope per ADR-006 — apply to every emitted CI (~100 cells) vs headline-only (~20 cells) vs operator-flagged per cell.
- Reference-scorer LLM-rater audit — defer per ROADMAP
[TBD-at-Phase-4]framing (defer recommendation) vs include now (~$5 LLM-rater pass) vs regex-based per-style tagger. - Figures slate + renderer — initial recommendation was matplotlib throughout; user-reaffirmed library-first invariant escalated to library-first hybrid with 4 upstream gap issues filed.
- Phase 5 prep interleaving — Phase 4 stays analysis-only vs interleave Quarto + index.qmd scaffolding in Commit 6 vs full Phase 4+5 fusion (rejected per ADR-038).
The walkthrough presented all 7 questions numbered; the user ratified 6 recommendations + explicitly overrode Q5 (include-now instead of defer). This ADR locks all seven; subsequent Phase 4 commits implement them.
Decision
Q1 — 6-commit cadence (Phase 2 + Phase 3 precedent)
Per-commit decomposition:
| Commit | Deliverable | Invariant test landed |
|---|---|---|
| 1 (this) | ADR-046 + SPEC_SHEET §3.8 + audit regen | n/a |
| 2 | src/eval/marginal_bootstrap.py (bootstrap_ci wrappers per ADR-022) + src/eval/cross_fold_ci.py (cv_clt_ci headline per ADR-024) + src/eval/mde.py (mde_from_ci wrappers per ADR-006) + smoke |
test_marginal_bootstrap_seed_stability + test_cv_clt_ci_methodology (deferred-unskip at canonical evals run) |
| 3 | src/eval/cross_fold_ci.py extension — always-emit block-bootstrap-on-folds spoke per A-008 + auto-flag column + smoke |
test_block_bootstrap_folds_spoke_present + test_a_008_flag_fired_when_ratio_exceeds_1_5 |
| 4 | src/eval/figures.py — library-first hybrid renderer per Q6; 7-figure slate; consumes eval_toolkit.plotting.* for F3 + F4 + F7-subpanels + project glue for F1 + F2 + F5 + F6 + F7-grid + smoke |
test_figures_slate_7_svgs_present + test_save_figure_provenance_chunks_present |
| 5 | Orchestration scripts: scripts/{run_marginal_bootstrap, run_cv_clt_ci, run_mde, render_figures, audit_reference_scorers}.py per Q4 + Q5 + Q6 |
smoke tests + cost-cap interactive-approval check for audit_reference_scorers |
| 6 | Makefile Phase 4 targets (marginal-bootstrap, cv-clt-ci, mde-battery, render-figures, audit-reference-scorers) + extended make smoke (Phase 4 fixture-pipeline pass under ADR-027 budget) + docs/ROADMAP.md Phase 4 close note + v0.9.0-rc1 rehearsal-tag prep doc |
n/a |
Q2 — Scaffold-with-classical-floor smoke (ADR-045 Q5 precedent)
All src/eval/ Phase 4 modules consume any predictions parquet matching the PredictionsRowModel schema contract per ADR-045 Q3. Smoke tests use the 12 classical-floor parquets from make eval-classical-floor + the tiny fixture parquets at tests/fixtures/processed/. Transformer-pred-consuming integration invariants (e.g., test_cross_fold_ci_methodology, test_bootstrap_n_and_stability_check) remain @pytest.mark.skip until the 72 transformer parquets exist from operator-gated canonical make headline-{frozen-probe, lora, full-ft} per ADR-020. Phase 5 (WRITEUP) can begin in parallel against classical-floor numbers.
Q3 — Always-emit-both with auto-flag (A-008 + ADR-024)
src/eval/cross_fold_ci.py (Commit 3) emits for every (rung, metric, slice) cell:
cv_clt_ci_lo,cv_clt_ci_hi,cv_clt_ci_halfwidthper Bayle 2020 Theorem 3.1 on 12 per-(fold, seed) valuesblock_bootstrap_ci_lo,block_bootstrap_ci_hi,block_bootstrap_ci_halfwidthper A-008 spoke ablationa_008_flag_firedboolean — True iffblock_bootstrap_ci_halfwidth / cv_clt_ci_halfwidth > 1.5
Persisted to evals/audit/cross_fold_ci_audit.parquet. WRITEUP/methodology.md (spoke) text references the LODO non-exchangeability claim only conditionally on the flag firing per A-008. Full audit-trail completeness preserved per ADR-013 persist-everything-report-selectively pattern.
Q4 — MDE on every emitted CI (ADR-006 mandate)
src/eval/mde.py (Commit 2) provides mde_from_ci_record(ci_lo, ci_hi, n, alpha=0.05, power=0.8) wrapping eval_toolkit.bootstrap.mde_from_ci. Orchestrated by scripts/run_mde.py (Commit 5) sweeping every CI cell across:
- Phase 3 Commit 5 paired-bootstrap cells (~30 cells;
evals/bootstrap/paired_*.parquet) - Phase 4 Commit 2 marginal-bootstrap cells (~40 cells; rungs × slices × metrics)
- Phase 4 Commit 3 cross-fold cells (~rungs × slices × metrics; both cv_clt + block-bootstrap)
- Operating-point diff cells (
paired_bootstrap_op_point_diff) - ECE delta cells (
paired_bootstrap_ece_diff)
Total persistence approximately 100 MDE cells in evals/audit/mde_per_cell.parquet. Phase 5 WRITEUP narrative picks reporting subset from the full matrix per ADR-013.
Q5 — Reference-scorer LLM-rater audit INCLUDED (user override)
User overrode the original defer recommendation. scripts/audit_reference_scorers.py (Commit 5):
- Samples ~50 prediction-pairs per reference rung (R-LLM-OpenAI + R-LLM-Anthropic + R-ProtectAI-v1 + R-ProtectAI-v2 per ADR-018) where the reference scorer disagrees with the trained-rung classifier (lora or full_ft headline).
- Rubric grades each pair as
(rater_judgment_correct_about_injection, calibration_assessment). - Cost-cap-gated with interactive approval per ADR-020 + ADR-045 Q4 (estimated ~$5 per A-002 envelope across all 4 reference rungs).
- Results persisted to
evals/audit/reference_scorer_rater_audit.json+ a methodology spoke section inWRITEUP/reference-scorer-audit.md.
User rationale: front-loading the audit deliverable is more valuable than waiting for a regex-tagger-conservative-enough trigger that may never fire.
Q6 — Library-first hybrid figures (revised after walkthrough audit)
src/eval/figures.py (Commit 4) ships the 7-figure slate as a library-first hybrid:
| Figure | Render path |
|---|---|
| F1 (Pareto AUPRC × compute) | project glue + set_plot_style + PALETTE + save_figure (gap filed as issue #15) |
| F2 (ROC per rung) | project glue + set_plot_style + PALETTE + save_figure (gap filed as issue #14; PR candidate per task #4) |
| F3 (PR per rung) | eval_toolkit.plotting.plot_pr_curve directly |
| F4 (reliability diagrams: raw + temperature + isotonic) | eval_toolkit.plotting.plot_reliability_diagram × 3 panels via subplots |
| F5 (per-slice OOD heatmap) | project glue + set_plot_style + save_figure (gap filed as issue #16) |
| F6 (LODO fold variance breakdown) | project glue + plot_metric_bars + plot_lift_ci + save_figure |
| F7 (dual-policy operating-point grid with reachability flags) | project glue grid layout + plot_bootstrap_distribution sub-panels + save_figure |
All output as SVG to docs/plots/ per ADR-030 Quarto site embedding. set_plot_style + PALETTE (negative/positive/baseline/accent) applied throughout for consistent styling.
Project-wide library-first invariant codified at walkthrough Q6 (memory entry library-first-is-project-wide-invariant 2026-05-16) — audit eval-toolkit + runpod-deploy + research_toolkit at every module-design step.
Q7 — Phase 5 prep deferred
Phase 4 stays analysis-only (single-concern phase per ADR-038 phase-tailoring lock). Phase 5 begins post-Phase-4-close with WRITEUP authoring + Quarto site infrastructure + model card scaffold. v0.9.0-rc1 rehearsal tag fires after Phase 4 close per ADR-033 triggering the full publish pipeline as a 24+ hour dress-rehearsal; fix-forward via new commits + v0.9.0-rc2 if rehearsal fails.
Consequences
Positive:
- Phase 4 ships on clean library-first baseline (ADR-047 carryforward refactor closed immediately before this ADR).
- Q3 always-emit-both produces a richer audit trail than conditional emission; reviewer can verify the A-008 sensitivity-check methodology independently of whether the flag fires.
- Q4 MDE-on-every-CI satisfies ADR-006 mandate explicitly; Phase 5 WRITEUP narrative has full matrix to draw from without re-running.
- Q5 user override produces an additional reference-scorer audit deliverable that strengthens the methodology spoke; ~$5 cost is well within A-002 budget envelope.
- Q6 library-first hybrid honors the project-wide invariant + files 4 upstream gaps that benefit other eval-toolkit consumers.
- Q7 phase-gate discipline preserves clean separation between Phase 4 analysis and Phase 5 writeup.
Negative / cost:
- Q4 MDE on every CI adds compute (~100 cells × ~0.01s per cell = trivial; not a real cost).
- Q5 LLM-rater audit adds ~$5 cost + interactive operator approval gate.
- Q6 library-first hybrid + 4 upstream gaps creates short-term coupling to upstream issue triage cadence (mitigated by project-glue fallbacks for all 4 gaps).
- 6-commit cadence adds operator-driven commit cadence overhead vs collapsed 4-commit option; chosen because Phase 2 + Phase 3 6-commit precedent worked.
Neutral:
- Q1 cadence + Q2 scaffold + Q7 Phase 5 defer all match prior phase precedent; no novel discipline.
- ADR-045 Q6 full-pairwise bootstrap persistence carries forward (Phase 4 reads
evals/bootstrap/rather than re-running). - ADR-038 phase-tailoring lock honored — no Phase 3+4 collapse, no Phase 4.5 split.
Alternatives Considered
- Q1 — 4-commit collapsed cadence: combine bootstrap primitives + figures + scripts. Rejected because: harder to revert atomically per Phase 2/3 lessons; user ratified 6-commit.
- Q2 — block on canonical runs: wait for
make headline-*to complete before starting Phase 4. Rejected because: indefinite operator-availability stall; loses parallel-track shipping discipline that worked in Phase 3. - Q3 — conditional emission: only emit block-bootstrap when cv_clt_CI half-width exceeds threshold. Rejected because: loses audit-trail completeness; A-008 flag becomes data-dependent in a way that complicates the methodology spoke.
- Q4 — headline-only MDE: compute MDE only on WRITEUP headline table cells (~20). Rejected because: misses ADR-006 contract on “every reported CI”; would need ADR amendment.
- Q5 — defer LLM-rater audit (original recommendation): ship
scripts/audit_reference_scorers.pyskeleton only; re-evaluate post-Phase-4. User-overridden: user explicitly chose include-now citing value of front-loading the audit. - Q6 — full matplotlib throughout (initial recommendation): hand-roll all 7 figures. Rejected after walkthrough Q6 user-reaffirmation: violates project-wide library-first invariant. Revised to library-first hybrid; 4 upstream gaps filed before any glue ships.
- Q7 — interleave Phase 5 prep in Commit 6: add Quarto + index.qmd scaffolding to Phase 4 close. Rejected because: muddies phase-gate; ADR-038 phase-tailoring lock applies; rehearsal tag naturally fires after Phase 4 close per ADR-033.
- Q7 — full Phase 4+5 fusion: collapse Phase 4 + Phase 5 into one extended cycle. Rejected per ADR-038 phase-tailoring lock — “Phase 3+4 collapse is rejected since Phase 4 carries first-class statistical-inference work that deserves its own phase-gate discipline”.
References
decisions/ADR-006-headline-metrics-and-statistical-apparatus.md— MDE-on-every-CI sourcedecisions/ADR-018-reference-scorer-slate-and-contamination-stratification.md— 4 reference rungs sourcedecisions/ADR-020-compute-infrastructure-and-cost-discipline.md— Q5 cost-cap-gated interactive approval sourcedecisions/ADR-021-eval-slate-aggregation-and-recall-fpr-pinpoints.md— 5-slice OOD slate sourcedecisions/ADR-022-statistical-inference-apparatus.md— bootstrap apparatus sourcedecisions/ADR-023-calibration-battery-and-interventions.md— F4 reliability sourcedecisions/ADR-024-cross-fold-ci-methodology.md— Q3 cv_clt_ci + A-008 sourcedecisions/ADR-025-dual-policy-threshold-characterization.md— F7 dual-policy sourcedecisions/ADR-031-reviewer-reading-paths-quarto-site-entry.md— WRITEUP spoke structuredecisions/ADR-033-github-release-strategy-rehearsal-plus-submission.md— v0.9.0-rc1 rehearsal tag sourcedecisions/ADR-038-phase-tailoring-light-roadmap-edits.md— Q7 phase-tailoring lockdecisions/ADR-045-phase-3-evaluation-implementation-bundle.md— Phase 3 precedent + Q6 full-pairwise persistencedecisions/ADR-047-phase-1-library-first-carryforward-refactor.md— Phase 1 carryforward refactor (closed immediately before this ADR)decisions/library_imports.md— library-first discipline ledger (expanded post-Phase-4 commits)decisions/upstream_issues.md— upstream gaps ledger (entries #14-19)https://github.com/brandon-behring/eval-toolkit/issues/14—plot_roc_curvePR candidate (F2)https://github.com/brandon-behring/eval-toolkit/issues/15—plot_pareto_frontier(F1)https://github.com/brandon-behring/eval-toolkit/issues/16—plot_slice_metric_heatmap(F5)https://github.com/brandon-behring/eval-toolkit/issues/17—paired_bootstrap_diff n_jobskwarg
Transcript
See transcripts/2026-05-16__phase-4-entry-plus-phase-1-library-first-refactor.md for the conversation that led to this decision (Phase 4 walkthrough Q1-Q7 + Q5 user override + Q6 library-first reaffirmation + ADR-047 carryforward refactor; saved 2026-05-16 mid-session per the save-transcripts-at-major-milestones discipline).