Upstream issues filed during this iteration

How to read this page. This is the evidence ledger behind the anti-hand-rolling rule. It is dense by design: each row ties a discovered upstream gap to the project code or documentation that depended on it.

Ledger tying anti-hand-rolling discipline to its evidence. Every discovered library gap, bug, or feature request is filed to the relevant upstream GitHub repo before any local workaround. Local site lists where the dependency or workaround lives in this repo.

Filed issues go to:

  • brandon-behring/eval-toolkit — evaluation primitives
  • brandon-behring/runpod-deploy — cloud orchestration
  • brandon-behring/research_toolkit — dossier production

Triage label on filed issues: tracked.

Current interpretation

  • Many early eval-toolkit gaps are now resolved and consumed; stale local workaround annotations should not be read as active debt unless the status column still says filed or contribution pending.
  • The v1.2.8 dependency bump records eval-toolkit #48, #49, and #53 as library-first maintenance only. It does not add a new methodology claim or change the reported results.
  • Demo-strategy ideas such as activation-delta analysis stay outside this patch unless a future ADR scopes them into the project.

Ledger scope (clarified at v1.2.13; updated at v1.2.14)

This ledger tracks issues filed as a consequence of this project’s library-first audits (Phase 1 carryforward refactor, Phase 4 figures, v1.0.x post-submission patches, v1.1.x DeBERTa ablation, v1.2.x library maintenance). Upstream issues filed independently of this project’s audit path are recorded here when a dependency bump references them (dependency/ledger maintenance per ADR-066 trigger #4), or noted as out-of-scope when no consumption is planned.

  • eval-toolkit #50 (feat(losses): RecallAtLowFPR — Meta Prompt Guard 2 recipe; CLOSED 2026-05-19; shipped in eval-toolkit v0.44.0; recorded/consumed as dependency maintenance at v1.2.14): loss-recipe addition. No methodology dependency for this project’s evaluation slate at v1.2.14; the pin bump records the upstream resolution without changing this project’s reported metrics.
  • eval-toolkit #51 (feat(preprocessing): spotlighting; CLOSED 2026-05-21; shipped in eval-toolkit v0.44.0; recorded/consumed as dependency maintenance at v1.2.14): defense-side preprocessing pattern (delimit/datamark/encode + sweep). Out of v1.0.0/v1.2.x evaluation slate scope per ADR-039 + the portfolio-repo plan ([[portfolio_plan_approved]] memory); recorded here because the v0.44.0 bump references its closure.
  • eval-toolkit #52 (feat(stacking): MetaLearner Protocol + LogisticStacker; CLOSED 2026-05-21; shipped in eval-toolkit v0.45.0; recorded/consumed as dependency maintenance at v1.2.16): stacking primitive; out of v1.0.0/v1.2.x evaluation slate scope per ADR-007 (single-rung detectors only at submission tier); recorded here because the v0.45.0 leg of the v0.44→v0.47 bump at v1.2.16 references its closure.

If any of these later become consumption candidates (e.g., a v1.3.x methodology decision opens preprocessing or stacking), a new ledger row will be added at consumption time with the full local file:line + status.

Date Repo Issue # Title Local site (file:line) Status
2026-05-15 brandon-behring/eval-toolkit #17 Add optional n_jobs parameter to paired_bootstrap_diff for internal resample-loop parallelization scripts/run_bootstrap_battery.py (Phase 4 deliverable per ADR-022) RESOLVED in eval-toolkit v0.42.0; ledger status updated at v1.2.2 per ADR-066et.paired_bootstrap_diff(..., n_jobs=N) available. Project keeps n_jobs=1 default per #41 worker-copy memory model (shared-state pattern is opt-in for callers).
2026-05-16 brandon-behring/eval-toolkit #14 Add plot_roc_curve (sibling to plot_pr_curve) src/eval/figures.py Phase 4 F2 deliverable per ADR-046 RESOLVED in eval-toolkit v0.42.0; consumed at v1.0.x per ADR-066 §B2et.plot_roc_curve dispatched in src/eval/figures.py::render_f2_roc_per_rung; stale annotation removed at v1.2.2 Commits 3-6 consolidated.
2026-05-16 brandon-behring/eval-toolkit #15 Add plot_pareto_frontier for cost-vs-performance scatter with frontier overlay src/eval/figures.py Phase 4 F1 deliverable per ADR-046 RESOLVED in eval-toolkit v0.42.0; consumed at v1.0.x per ADR-066 §B1et.plot_pareto_frontier dispatched in src/eval/figures.py::render_f1_pareto; stale annotation removed at v1.2.2 Commits 3-6 consolidated.
2026-05-16 brandon-behring/eval-toolkit #16 Add plot_slice_metric_heatmap for (group_x × group_y × metric) grids src/eval/figures.py Phase 4 F5 deliverable per ADR-046 RESOLVED in eval-toolkit v0.42.0; consumed at v1.0.x per ADR-066 §B3et.plot_slice_metric_heatmap dispatched in src/eval/figures.py::render_f5_slice_heatmap; stale annotation removed at v1.2.2 Commits 3-6 consolidated.
2026-05-16 brandon-behring/eval-toolkit #18 Wire dedup-holdout golden test against this project’s 50-pair adversarial dataset data/dedup_holdout.jsonl + scripts/calibrate_dedup.py + evals/dedup_calibration.json filed at Phase 1 library-first audit (per ADR-047); golden-test contribution candidate
2026-05-16 brandon-behring/eval-toolkit #19 Add cookbook docs: 3 compositional patterns (nested-seed splits + callable-embedder strategy + pairs_across contamination scan) src/data/{splits, dedup, audit}.py (Phase 1 hand-rolls that motivated this) filed at Phase 1 library-first audit (per ADR-047); docs PR candidate
2026-05-16 brandon-behring/eval-toolkit #20 Generalize mde_from_ci to accept BootstrapCI \| PairedBootstrapCI for marginal-MDE use case scripts/run_mde.py Phase 4 deliverable per ADR-046 Q4 RESOLVED in eval-toolkit v0.42.0; consumed at v1.0.x per ADR-066 §B5et.mde_from_ci generalized; src/eval/mde.py::mde_from_marginal_ci_record uses the upstream API directly. Stale schemas.py:336 annotation removed at v1.2.2 Commits 3-6 consolidated.
2026-05-16 brandon-behring/eval-toolkit #21 Add block_bootstrap_on_folds (CV-aware block bootstrap; complement to cv_clt_ci) src/eval/cross_fold_ci.py Phase 4 deliverable per ADR-046 Q3 + ADR-024 + A-008 RESOLVED in eval-toolkit v0.42.0; consumed at v1.0.x per ADR-066 §B6et.block_bootstrap_on_folds shipped; src/eval/cross_fold_ci.py::compute_block_bootstrap_on_folds thin-wraps. Stale cross_fold_ci.py:202 annotation removed at v1.2.2 Commits 3-6 consolidated.
2026-05-17 brandon-behring/eval-toolkit #22 Add ax: Axes \| None kwarg to plot_metric_bars (parity with plot_pr_curve / plot_reliability_diagram / plot_lift_ci / plot_bootstrap_distribution) src/eval/figures.py::render_f6_lodo_breakdown Phase 4 Commit 4 per ADR-046 Q6 RESOLVED in eval-toolkit v0.42.0; consumed at v1.2.2 per ADR-066 §B4et.plot_metric_bars(ax=...) shipped; bare-matplotlib bars in src/eval/figures.py::render_f6_lodo_breakdown left panel replaced with library-first dispatch in v1.2.2 Commits 3-6 consolidated (the only ACTUAL refactor site in the v1.2.2 series — all others were already-consuming + stale-comment cleanup).
2026-05-17 brandon-behring/runpod-deploy #88 Make SSH-ready timeout configurable (default 240s too aggressive for cold image pulls) scripts/runpod_deploy_long_ssh.py (monkey-patch shim with 600s deadline); Makefile headline-{frozen-probe,lora,full-ft} use shim for run --config RESOLVED in runpod-deploy v0.8.2 (configurable budget.ssh_ready_timeout_sec; default 900s). Consumed at v1.1.0 per ADR-059: shim DELETED in same commit as pin bump per no-orphaned-code invariant; budget.ssh_ready_timeout_sec: 600 explicit override added to all 3 headline-*.yaml configs (matches the shim’s 600s value; preserves effective behavior).
2026-05-17 brandon-behring/runpod-deploy #92 (doc) + #93 (PR) + #94 (scan-consumer) UV_LINK_MODE=copy + UV_CACHE_DIR + UV_PROJECT_ENVIRONMENT off /workspace required on FUSE-mounted /workspace (silent hang under default hardlink mode + git-resolution F_SETLKW hang + HF Trainer atomic-save F_SETLKW hang) configs/runpod/headline-{frozen_probe,lora,full_ft}.yaml remote_env.exports (Phase 4 X7 + X8 commits 13f06b8/8af73ae); memory entry memory/fuse-workspace-needs-uv-link-mode-copy.md (3 FUSE failure modes); seed manifest + cost ledger (artifacts/runpod/manual-20260517T155912Z/runpod_deploy_pull_manifest.json + evals/cost_ledger.csv commit 7c1f36c) RESOLVED — PR #93 merged + #92/#94 docs/scan-consumer landed in runpod-deploy v0.8.x. v1.1.0 consumes upstream resolutions via the pin bump (ADR-059); existing remote_env.exports UV_LINK_MODE=copy + UV_CACHE_DIR + UV_PROJECT_ENVIRONMENT entries retained as belt-and-suspenders (no upstream code requires removal). External verification: uv#17801, MooseFS#380, Linux kernel patch 2025-12-23.
2026-05-17 brandon-behring/runpod-deploy #90 (feat) lifecycle.on_success: recycle — resume paused pods on run retry (avoid re-provisioning when only config edits changed) bypassed orchestrator via manual SSH for i5p43hfq88zpqt recovery (Phase 4 re-fire #5); no local code dependency (waiting on upstream feature) RESOLVED — lifecycle.on_success: recycle semantics landed in runpod-deploy v0.8.x. Consumed at v1.1.0 Commit 3 of 3 for the DeBERTa-v3-base medium ablation via new configs/runpod/headline-deberta.yaml (per /exploring-options 2026-05-19 Q2 lock — sequential single pod, 2 training fires sharing 1 pod via recycle; saves ~$1-2 + ~3-5 min vs full teardown between truncation strategies). The 3 existing headline configs use on_success: delete per ADR-013 pre-teardown checklist; recycle is opt-in.
2026-05-17 brandon-behring/runpod-deploy #97 (feat) validate --all should HTTP HEAD-check pod.image against the public registry (catch phantom tags before pod fires) no local code dependency (validate-time check; consumer-side fix was image-refresh in configs/runpod/headline-*.yaml Phase 4 X5 commit b629181) RESOLVED — image-registry HEAD-check landed in v0.8.x; invoked by default within validate --all. Consumed automatically at v1.1.0 (no project-side wiring needed); make headline-dry-run now reports [image-registry] ok: 'runpod/pytorch:...' exists on Docker Hub line for each of the 3 migrated configs. Catches phantom Docker Hub tags before billing.
2026-05-17 brandon-behring/runpod-deploy #98 (docs) Makefile-recipe pattern doc when runpod-deploy is in [project.optional-dependencies] dev (consumers hit cryptic PATH error otherwise) Makefile recipes use uv run runpod-deploy ... (Phase 4 X4 sed-rewrite commit a04e016) RESOLVED — Makefile-recipe pattern documented in runpod-deploy v0.8.x docs. Consumed at v1.1.0 via continued use of the uv run runpod-deploy ... pattern in the Makefile (now also covers the 3 callers previously delegating to the deleted scripts/runpod_deploy_long_ssh.py shim — they now use uv run runpod-deploy run ... directly per #88 closure).
2026-05-18 brandon-behring/eval-toolkit #39 Add is_metric_defined_for_slice(slice_name, metric_name) -> bool primitive (or equivalent eval_toolkit.metrics.skip_single_class_slices filter) so consumers don’t hand-roll the “AUROC/AUPRC undefined on single-class slices” rule src/eval/slice_analysis.py::is_metric_defined_for_slice (v1.0.0 Item 4 single-class filter) + src/eval/marginal_bootstrap.py::compute_marginal_battery source-level filter + scripts/run_cv_clt_ci.py orchestrator filter RESOLVED in eval-toolkit v0.39.0 (closed 2026-05-18 20:13 UTC). Consumed at v1.0.6: src/eval/slice_analysis.py::is_metric_defined_for_slice now thin-wraps upstream eval_toolkit.is_metric_defined_for_slice(metric_name, is_single_class=slice_name in SINGLE_CLASS_SLICES); local SINGLE_CLASS_INCOMPATIBLE_METRICS constant removed (uses upstream eval_toolkit.SINGLE_CLASS_INCOMPATIBLE_METRICS). Project-specific knowledge of which slice NAMES are single-class stays local (SINGLE_CLASS_SLICES frozen-set kept).
2026-05-18 brandon-behring/eval-toolkit #40 Relax LeakageCheck Protocol name member from settable-attr to read-only @property (or Final[str]) so @dataclass(frozen=True) LeakageCheck implementations like CrossSplitLeakageCheck are mypy-strict-compatible src/data/audit.py:222cast(LeakageCheck, leakage_check) workaround at the call site RESOLVED in eval-toolkit v0.39.0 (closed 2026-05-18 20:13 UTC). Consumed at v1.0.6: cast(LeakageCheck, leakage_check) workaround removed at src/data/audit.py:222; unused LeakageCheck import removed from line 34. Confirmed LeakageCheck.name is now an @property via runtime inspect check.
2026-05-18 brandon-behring/eval-toolkit #41 _parallel.parallel_map joblib worker copies the entire input spec (including DataFrames passed in the tuple) — OOM-risk on many-core machines when each cell carries a ~30-100 MB predictions DataFrame × N workers src/eval/marginal_bootstrap.py::_compute_one_cell_or_none worker + compute_marginal_battery default n_jobs=1 (opt-in parallelism) RESOLVED in eval-toolkit v0.39.0 (closed 2026-05-18 20:13 UTC) — upstream documented the worker-copy memory model in methodology/parallelism.md. Our defensive n_jobs=1 default at src/eval/marginal_bootstrap.py::compute_marginal_battery remains correct + now well-documented upstream. No code change required at v1.0.6 (the upstream resolution was docs-only; the worker-copy semantics themselves are joblib-loky design — shared-state pattern remains opt-in for callers).
2026-05-18 brandon-behring/eval-toolkit #43 Add fit_platt_binary + fit_beta_binary calibrators (binary-class scalar-prob adapters; siblings of fit_temperature_binary shipped in v0.35.0) src/eval/calibration_battery.py — Platt + Beta deferred per ADR-023 original scope; v1.0.8 wants to consume upstream rather than hand-roll RESOLVED in eval-toolkit v0.40.0 (closed 2026-05-19 00:17 UTC; ~17 min after filing — fastest upstream turnaround of the v1.0.x series). Consumed at v1.0.8 per ADR-056: src/eval/calibration_battery.py refactored to use fit_platt_binary + fit_beta_binary (canonical (params, apply) shape) + fit_temperature_binary (the multi-class fit_temperature API we had been using was retired) + local fit_isotonic_binary_local adapter pending #44. ADR-023 frontmatter superseded_by: [056] in-place.
2026-05-19 brandon-behring/eval-toolkit #44 Add fit_isotonic_binary for shape consistency with fit_temperature_binary + fit_platt_binary + fit_beta_binary (returns (None, apply) since isotonic is non-parametric) src/eval/calibration_battery.py::fit_isotonic_binary_local — thin adapter wrapping fit_isotonic_calibrator to return (None, apply) shape RESOLVED in eval-toolkit v0.42.0 (closed 2026-05-19 01:20 UTC; release 01:25 UTC — 5 min between closure + PyPI publish). Consumed at v1.0.9: src/eval/calibration_battery.py refactored to use upstream fit_isotonic_binary directly; fit_isotonic_binary_local adapter + orphaned Callable import deleted in same commit per no-orphaned-code invariant. eval-toolkit pin bumped v0.40.0 → v0.42.0 (v0.41.0 skipped because it predated the #44 close). 4-of-4 binary calibrator family now lands on canonical upstream _binary API.
2026-05-19 brandon-behring/eval-toolkit #48 Add ood_dataset_from_manifest helper for manifest-driven OOD dataset assembly pyproject.toml + decisions/library_imports.md dependency pin only in v1.2.8; no current local call site RESOLVED in eval-toolkit v0.43.0; recorded/consumed as dependency maintenance at v1.2.8. No project methodology, model, data, or artifact output changes; future OOD dataset assembly can use the upstream helper instead of local glue if that work is opened.
2026-05-19 brandon-behring/eval-toolkit #49 Add character_injection core-6 prompt-injection pattern support (+ advanced-6 extension shipped in v0.47.0) pyproject.toml + decisions/library_imports.md dependency pin only in v1.2.8 + v1.2.16; no current local call site RESOLVED in eval-toolkit v0.43.0 (core-6) + v0.47.0 (advanced-6 + sweep unification); recorded/consumed as dependency maintenance at v1.2.8 + v1.2.16. No prompt family is added to this project’s evaluation slate; the reviewer-facing v1.0.0 pin and live-site results remain unchanged.
2026-05-19 brandon-behring/eval-toolkit #53 Add ActivationDeltaProbe for activation-delta analysis pyproject.toml + decisions/library_imports.md dependency pin only in v1.2.8; no current local call site RESOLVED in eval-toolkit v0.43.0; recorded/consumed as dependency maintenance at v1.2.8. Activation-delta demos/strategy remain separate from this patch; no analysis claim is added here.
2026-05-19 brandon-behring/eval-toolkit #36 Add inline bootstrap CI on every metric (Inspect-AI / lm-eval scorecard pattern; with_ci=True kwarg on pr_auc/roc_auc/brier_score/ece variants) src/eval/marginal_bootstrap.py:31-32,42-43,110 + src/eval/calibration_battery.py:47-52 — current explicit bootstrap_ci(metric_fn, ...) orchestrator would simplify with inline-CI sugar RESOLVED in eval-toolkit v0.46.0 (different API shape than originally proposed); recorded as dependency maintenance at v1.2.16. Upstream chose scorecard(metric_specs, ..., bootstrap=True) + Scorecard (read-only Mapping[str, MetricResult]) + metric_specs (type-safe metric spec objects) over the original with_ci=True per-function kwarg — addresses the same use case (one-call multi-metric bootstrap CIs) with type-safe contracts. Our project’s contribution comment at #36 issue-comment was the request. Consumer-side migration deferred to a future v1.3.x cycle if the bootstrap_ci(metric_fn, ...) orchestrator pattern is replaced with scorecard(...).
2026-05-22 brandon-behring/eval-toolkit #71 audit.reader_value_bindings — validate detector→value bindings in reader-prose Markdown (catches the v1.3.2 P1-1 Act 0 bug class where canonical 0.974 was bound to TF-IDF instead of LoRA) scripts/audit_value_bindings.py (v1.3.8) wraps eval_toolkit.audit_value_bindings.validate_reader_value_bindings with the consumer’s BINDINGS dict + DETECTOR_ALIASES + METRIC_ALIASES regex maps. RESOLVED in eval-toolkit v1.0.3; consumed at v1.3.8. Filed at v1.3.3; PR merged into upstream main; v1.0.3 release shipped 2026-05-26 02:35Z as second member of the audit-validator family (after #73 / audit_citation_alignment at v1.0.1).
2026-05-22 brandon-behring/eval-toolkit #72 audit.sister_doc_concept_drift — detect cross-doc semantic drift on shared concept tokens (catches the v1.3.2 P1-2 T1-tier-meaning bug class where one doc said T1 = full cloud rerun and the linked sister said T1 = laptop smoke) No existing script in this project; gap surfaced by Codex external audit + Claude proactive Phase-2 sweep. filed at v1.3.3; triaged 2026-05-24 (P3, enhancement,tracked, #3 of 3 in audit/ subpackage drop per maintainer sequencing); awaiting upstream PR after #74 + #71 land
2026-05-22 brandon-behring/eval-toolkit #73 audit.adr_citation_alignment — validate that “per ADR-NNN” citations match the cited ADR’s actual subject (catches the v1.3.2 P1-2 ADR-029-cited-for-tier-lock + ADR-039-cited-for-cost bug class) scripts/audit_citation_alignment.py (v1.3.7) wraps eval_toolkit.audit_citation_alignment.validate_citations with the consumer’s CATEGORY_KEYWORDS map + builds ADRSubject records from decisions/ADR-*.md frontmatter. RESOLVED in eval-toolkit v1.0.1; consumed at v1.3.7. Filed at v1.3.3; PR #74 merged 2026-05-25T22:51:15Z (restructured pre-merge to flat module per upstream ADR 0001: src/eval_toolkit/audit_citation_alignment.py not subpackage); shipped in v1.0.1 release ~22:55Z; v1.0.2 release ~23:30Z added #76 cleanup batch (same module blob).
2026-05-26 brandon-behring/eval-toolkit #82 audit_citation_alignment — multi-ADR-citation context-awareness (R15 follow-on; Layer 2 scope + Layer 3 pairing extension to the citation-alignment validator). 188 warnings on consumer repo at tree/v1.3.13 HEAD with eval-toolkit==1.3.0. Three patterns: (α) dense multi-ADR-citation list within one paragraph; (β) spec/table per-row ADR citations; (γ) multi-claim sentence with ADR per-clause. Same architectural class as v1.3.0 Layer 3 pairing rules just resolved for audit_value_bindings via #81. scripts/audit_citation_alignment.py SKIP_PATTERNS mirrored at v1.3.13 (added SUBMISSION.md, _codex.md, AUDIT_CLAUDE_, draft.md, draft_review.md for parallel discipline with audit_value_bindings). No further consumer-side workaround applied. filed at v1.3.13 (2026-05-26); sixth library-first cycle (R15-equivalent following R11→R14 closure for audit_value_bindings). Proposed approach: Path A (scope='narrative' mirror of v1.1.0) + Path B (pairing rules mirror of v1.3.0). Acceptance criterion: 188 → ≤20 warnings on consumer HEAD. Consumer-side BUNDLED HARD-gate promotion (audit_value_bindings + audit_citation_alignment together per v1.3.8 plan) deferred to v1.3.14+ pending upstream resolution. Awaiting upstream triage.
2026-05-26 brandon-behring/eval-toolkit #81 audit_value_bindings — cross-detector list-grammar follow-on (v1.3.0+ deferred work per upstream ADR 0005 + Round 13 ledger). Three patterns: (1) “X for the frozen probe and Y for TF-IDF” for X list connective; (2) “A’s metric is X against B’s Y” possessive comparative; (3) “in-pool X → cross-family Y for the trained detectors” sentence-shift contrast. After v1.2.0 T1-T4 context filters resolved 32 of 36 residual FPs, the remaining 4 are pure cross-detector list-grammar that the v1.2.0 positional heuristic can’t disambiguate. scripts/audit_value_bindings.py migrated at v1.3.11 from 2-tuple to BindingKey schema + SLICE_ALIASES + scope='narrative'; pin bumped v1.1.0→v1.2.0 at v1.3.12 (T1-T4 context filters); pin bumped v1.2.0→v1.3.0 at v1.3.13 (Layer 3 pairing rules; no script changes — Tier-1 ADDITIVE). Filed against README.md:71, RESULTS.md:171 (×2), WRITEUP_PAPER.md:304. RESOLVED in eval-toolkit v1.3.0 (2026-05-26T22:27Z; ~51min compressed cycle); consumed at v1.3.13. Upstream shipped Layer 3 pairing rules (4 patterns A/B/C/D per ADR 0006) closing all 4 consumer residuals + emerging Pattern D during upstream dogfood. Consumer dogfood at v1.3.13: 4→0 warnings (100% reduction); closes R11→R14 cycle on audit_value_bindings. Upstream Round 14 ledger explicitly invites HARD-gate promotion; consumer side defers to v1.3.14+ pending #82 (audit_citation_alignment context-awareness) for the bundled-HARD plan per v1.3.8 CHANGELOG.
2026-05-26 brandon-behring/eval-toolkit #80 audit_value_bindings — extend BINDINGS schema with slice axis (3-tuple (detector, metric, slice) keys) so same-detector + same-metric across different slices don’t cross-flag. Architectural follow-on to #71: the current 2-tuple schema cannot distinguish direct-validation AUPRC from pooled-OOD AUPRC, point estimates from paired deltas, or detector values from random-floor values — produces ~95% false-positive rate on the consumer repo. F1 (TF-IDF/LoRA value-binding swap in WRITEUP_PAPER.md:545) was correctly flagged at v1.3.8 but hid in 96 warnings of noise. v1.3.9 ships the F1 fix; the 95+ other false-positive warnings persist until upstream lands. scripts/audit_value_bindings.py migrated at v1.3.11 from 2-tuple (detector, metric) schema to upstream-recommended BindingKey structured-key schema per ADR 0005; BINDINGS expanded from 2 entries to 15 covering 5 detectors × 2 metrics × 4 slices (direct_validation, pooled_ood, jbb, xstest); added SLICE_ALIASES mapping + scope='narrative' content-type filter. v1.3.12 follow-on: pin bump v1.1.0→v1.2.0 (Tier-1 ADDITIVE; T1-T4 context-aware narrative filters activate automatically via existing scope='narrative'). RESOLVED in eval-toolkit v1.1.0 (2026-05-26T20:11Z; ~2h compressed cycle); consumed at v1.3.11. Upstream shipped MORE than proposed: BindingKey (frozen dataclass; forward-extensible for future axes per ADR 0005) instead of bare 3-tuple, plus scope='narrative' content-type filter (excludes tables + bracketed expressions + code blocks) — the second architectural layer beyond what was filed. Consumer dogfood at v1.3.11: 96→36 warnings (62% reduction on this repo; upstream dogfood reported 76% against v1.3.9 baseline before v1.3.10 additions). Remaining 36 were upstream-acknowledged positional-heuristic limitations (multi-detector list pairings, sub-clause sentence boundaries, “vs” comparisons). v1.2.0 follow-on shipped 2026-05-26T21:20Z (~1h after v1.1.0 closure) addressing 32 of 36 residuals via T1-T4 context-aware narrative filters; consumed at v1.3.12: 36→4 warnings (96% total reduction from 96-baseline). Remaining 4 residuals are pure cross-detector list-grammar (v1.3.0+ upstream parser-level work — see separate filed issue). HARD-gate promotion still deferred per v1.3.8 bundled-promotion plan; awaiting cross-detector list-grammar resolution OR observation window confirmation.

How to use this ledger

When you discover a gap during Phase 1+ work:

  1. File the upstream issue with tracked label (use gh issue create)
  2. Add a row to this ledger with the issue URL + the local file:line that depends on it
  3. If a workaround is unavoidable, leave a TODO comment in code citing the issue number; remove when upstream lands

A workaround that hand-rolls a primitive without first filing the issue is an anti-pattern (see SPEC_GREENFIELD.md §7 Anti-patterns).

Test-coverage-gap entries (per ADR-028)

When a coverage gap surfaces under the 70% floor that would be better addressed by an upstream library test (e.g., a test pattern that should live in eval-toolkit’s harness coverage, or a runpod-deploy preflight scenario) rather than absorbed as a low-value local test:

  1. File the upstream issue with the proposed test pattern (sketch, not implementation), the rationale (why upstream is the right home), the local file:line that depends on the absent test, and the tracked label.
  2. Add a row to this ledger with the [test-coverage-gap] prefix in the Title column + the issue URL + the local file:line.

When a gap genuinely cannot be filed upstream (project-specific glue) AND cannot be cheaply tested locally:

  1. Leave a code comment with the rationale (e.g., # noqa: COV — Phase 0-06 deferral; project-specific orchestration glue, see decisions/upstream_issues.md).
  2. Add a row to this ledger with the [not-applicable] prefix in the Title column + the local file:line + the deferral rationale in the Status column.

Both forms preserve the discipline trail without forcing local anti-tests under the 70% floor.

A coverage gap that gets papered over with a no-op test (or with # pragma: no cover and no ledger entry) is an anti-pattern.