Statistical inference apparatus — bootstrap N + stability check, multi-comparison stance, multi-seed protocol, paired-test method

Published

May 15, 2026

ADR-022: Statistical inference apparatus — bootstrap N + stability check, multi-comparison stance, multi-seed protocol, paired-test method

Status

Accepted (2026-05-15). Formalizes ADR-006 brief-level pre-locks at the §3 Eval ledger-row level + adds row-specific operational details.

Context

ADR-006 (Phase 0-00 brief alignment) pre-locked the brief-level inferential stance: 10K bootstrap iterations, BCa marginals plus paired Delta-CIs, estimation-over-testing, 3 seeds paired-across-rungs, no formal p-tests. The §3 Eval ledger rows 340 / 341 / 344 / 345 require formalization at the row level with the operational details ADR-006 left implicit:

  • Row 340 (bootstrap N): the SPEC-text “pinned seed with stability check at a second seed” was not operationalized
  • Row 341 (multi-comparison correction): ADR-006 stated “not applicable” but did not specify whether to surface the acknowledgment in the writeup
  • Row 344 (multi-seed protocol): the “paired-across-rungs” semantics — what is paired and how heterogeneous rung observation counts (12 trained vs 4 reference) reconcile — were ambiguous
  • Row 345 (paired-test method): DeLong / McNemar rejection rationale was stated at the brief level but lacked the LODO-specific operational reasoning

Phase 0-04 walks Q2 / Q3 / Q6 / Q7 surfaced 6 gaps in the original pre-lock — reference-rung observation-count asymmetry; aggregation order for rank-based vs per-row metrics; (fold, seed) blocking violating iid; LLM-judge non-determinism at T=0; recall@FPR threshold drift across seeds; per-(rung, fold, seed) calibrator fitting. This ADR resolves each gap with a methodology-honest default.

Decision

Bootstrap N plus stability check (row 340)

Layer Decision
Headline iterations 10K via eval_toolkit.bootstrap_ci (BCa for marginals); 10K via eval_toolkit.paired_bootstrap_diff (percentile for paired Delta-CIs)
Stability check 10K at a second resampling seed (seed=2); flag when stability-check half-width differs from headline half-width by more than 5 percent
Parallelization joblib.Parallel(n_jobs=-1) at the orchestrator layer in scripts/run_bootstrap_battery.py; library primitive itself stays single-threaded as shipped (library-first discipline); 64-core Threadripper expected to compress approximately 10000 independent CIs from hours of serial walltime to minutes
Upstream issue Filed against eval-toolkit proposing optional n_jobs parameter on paired_bootstrap_diff for internal resample-loop parallelization

Multi-comparison correction (row 341)

No formal correction applied per ADR-006 estimation-over-testing stance. The methodology spoke at WRITEUP/methodology.md gains an explicit “Family of comparisons” acknowledgment paragraph that:

  • Names the comparison family size (~28 paired rung-vs-rung comparisons across 8 rungs choose 2)
  • Cites Gelman and Loken 2014 “garden of forking paths” framing
  • Cites the ASA 2016 statement on p-values supporting effect-size-with-CI over null-hypothesis testing
  • States explicitly that paired-bootstrap Delta-CIs are not Bonferroni-corrected and explains the rationale (correction applies to significance-testing; we report effect sizes)

Multi-seed protocol details (row 344)

3 seeds {42, 1337, 2025} per ADR-006 ratified plus the following gap-honest implementation defaults:

Sub-decision Locked policy
Trained-rung observations per rung 12 = 4 LODO folds × 3 seeds per ADR-016
Reference-rung observations per rung 4 = 4 LODO folds × 1 (no seed dimension; inference-only at T=0)
Trained-vs-trained pairing Row-level pairing via eval-toolkit paired_bootstrap_diff (same y_true rows across both rungs)
Trained-vs-reference pairing Per-row replication of reference scores across the 12 trained seeds; reference-side variance is fold-only (correctly so — reference has no seed dimension)
Rank-based metric aggregation (PR-AUC, ROC-AUC, recall@FPR) Compute per-(fold, seed) → 12 values per rung → mean → cross-fold CI per ADR-024; pool-then-compute reported in spoke as sensitivity-check ablation
Per-row metric aggregation (ECE, Brier) Pool rows across (fold, seed) and compute once per rung (within-block iid violation is small for per-row scoring rules)
Recall@FPR threshold computation Per-(seed) thresholds computed from val per-(rung, fold); applied to test; 12 thresholds per (rung, fold) yielding 12 recall values per rung; averaged
Calibration intervention fit Per-(rung, fold, seed) → 12 calibrators per rung × 2 interventions (temperature + isotonic) per ADR-023
Per-seed transparency Per-(rung, fold, seed) values dumped to evals/audit/per_seed_observations.parquet per ADR-011 Guarantee 5
Variance attribution spoke Per-rung decomposition: fold-to-fold vs seed-to-seed vs within-(fold, seed) bootstrap noise
LLM-judge non-determinism Reference scores cached at first call; re-run only on cache miss; inter-call variance not measured (A-007)

Paired-test method (row 345)

eval-toolkit paired_bootstrap_diff (Efron-Tibshirani 1993 §10.3 row-level pairing) ratified. Rejection rationale for the field-standard alternatives at the row level:

Method Why rejected (multi-source-LODO-specific)
DeLong 1988 Asymptotic Gaussian assumption breaks at per-fold n ≈ 4-5K benigns; designed for AUROC only — doesn’t extend to AUPRC or ECE; produces p-value contradicting estimation-over-testing; LODO fold-blocking violates the iid assumption underlying the DeLong variance estimate
McNemar Threshold-dependent; would require committing to a single deployment operating point we explicitly refuse to commit to (ADR-006 + scenario-based framing); doesn’t handle the dual-policy framing
Cochran’s Q / Friedman Designed for fixed-classifier-vs-many-datasets — inverse of our setting (many rungs vs few LODO folds); produces p-value
Permutation test on Delta Equivalent to bootstrap on the null hypothesis; would require explicit p-value reporting; contradicts ADR-006 stance; more compute

Consequences

Positive

  • All four rows formalized at the ledger-row level with operational details ADR-006 left implicit; A2 reviewer can read exact apparatus without inferring from brief-level claims
  • Parallelization-via-glue plan preserves library-first discipline while leveraging 64-core hardware — upstream issue captures the future improvement path
  • Gap-honest multi-seed defaults eliminate the silent-aggregation-choice failure mode (the choice of “pool-rows-and-compute-once” vs “compute-per-(fold,seed)-and-aggregate” is itself a methodology decision; both are now explicit)
  • Multi-comparison acknowledgment paragraph pre-empts A2 reviewer concern; cheap (~1 paragraph)
  • DeLong / McNemar rejection rationale at the row level satisfies ADR-011 Guarantee 8 (no untracked methodology components)

Negative

  • Per-(rung, fold, seed) audit JSON pipeline + per-seed observations parquet add Phase 3 deliverables
  • Variance-attribution spoke subsection requires Phase 4 analysis work (per-rung ANOVA-style decomposition)
  • LLM-judge response caching infrastructure (Phase 1 deliverable) adds a small persistence concern

Neutral

  • Rank-based-metric-aggregation choice (compute-per-(fold, seed)-then-mean) pushes weight onto ADR-024’s cross-fold CI machinery; the two ADRs are intentionally coupled

Alternatives considered

For each row’s alternatives, see ADR-006’s Alternatives Considered section. The ADR-022-specific additions:

  • No stability check (Option A of Q2): rejected because cannot distinguish narrow-CI from narrow-but-lucky-CI; cheap to add second-seed run
  • 100K-iteration headline plus 10K second-seed ablation (Option C of Q2): rejected because 100K walltime ~10x of 10K at scale of ~10000 independent CIs; marginal gain over 10K+10K is small per Efron-Tibshirani 1993 iteration-count guidance
  • Silent multi-comparison treatment (Option A of Q3): rejected because A2 reviewer may infer cherry-picking from many-comparisons surface
  • BH-FDR post-hoc on the comparison family as ablation (Option C of Q3): rejected because adds inferential apparatus contradicting estimation-over-testing stance; muddles narrative
  • Pre-registration of primary-vs-exploratory comparisons (Option D of Q3): rejected because adds bureaucracy; some comparisons are genuinely planned-vs-incidental and the line is hard to draw cleanly
  • Seed-only pairing across folds (alternative to (fold, seed)-tuple pairing in Q6): rejected because loses fold-level pairing
  • Trained-vs-reference per-rung-seed-aggregation (Option (b) of Q6 Gap 1): rejected because drops trained-side seed variance from the Delta-CI making it tighter than honest

Phase 1 deliverables

  • scripts/run_bootstrap_battery.py with joblib parallelization scaffolding
  • evals/audit/per_seed_observations.parquet schema documented in evals/audit/README.md
  • evals/audit/llm_judge_cache/ directory layout documented (per-row response cache for gpt-4o + claude-sonnet)
  • decisions/upstream_issues.md gains entry for paired_bootstrap_diff parallelization proposal

References

See frontmatter references list. Primary anchors — Efron 1979 plus Efron-Tibshirani 1993 §10.3 (bootstrap apparatus); DiCiccio-Efron 1996 (BCa derivation); Gelman-Loken 2014 (forking paths); ASA 2016 statement on p-values; Gelman-Stern 2006 (significance vs not-significance distinction); DeLong 1988 (the test we are not using); Bouckaert 2003 (multi-seed paired-across-rungs precedent); Demsar 2006 (per-dataset reporting in multi-dataset settings); ADR-006 (brief-level pre-lock); ADR-011 Guarantee 5 (no cherry-picking seeds); ADR-016 (LODO + 12-obs-per-rung); ADR-018 (reference rung slate + contamination taxonomy).

Transcript

See transcripts/2026-05-15__phase-0-04__eval-framework.md for the conversation that led to this decision.