Upstream issues filed during this iteration

How to read this page. This is the evidence ledger behind the anti-hand-rolling rule. It is dense by design: each row ties a discovered upstream gap to the project code or documentation that depended on it.

Ledger tying anti-hand-rolling discipline to its evidence. Every discovered library gap, bug, or feature request is filed to the relevant upstream GitHub repo before any local workaround. Local site lists where the dependency or workaround lives in this repo.

Filed issues go to:

brandon-behring/eval-toolkit — evaluation primitives
brandon-behring/runpod-deploy — cloud orchestration
brandon-behring/research_toolkit — dossier production

Triage label on filed issues: tracked.

Current interpretation

Many early eval-toolkit gaps are now resolved and consumed; stale local workaround annotations should not be read as active debt unless the status column still says filed or contribution pending.
The v1.2.8 dependency bump records eval-toolkit #48, #49, and #53 as library-first maintenance only. It does not add a new methodology claim or change the reported results.
Demo-strategy ideas such as activation-delta analysis stay outside this patch unless a future ADR scopes them into the project.

Ledger scope (clarified at v1.2.13; updated at v1.2.14)

This ledger tracks issues filed as a consequence of this project’s library-first audits (Phase 1 carryforward refactor, Phase 4 figures, v1.0.x post-submission patches, v1.1.x DeBERTa ablation, v1.2.x library maintenance). Upstream issues filed independently of this project’s audit path are recorded here when a dependency bump references them (dependency/ledger maintenance per ADR-066 trigger #4), or noted as out-of-scope when no consumption is planned.

eval-toolkit #50 (feat(losses): RecallAtLowFPR — Meta Prompt Guard 2 recipe; CLOSED 2026-05-19; shipped in eval-toolkit v0.44.0; recorded/consumed as dependency maintenance at v1.2.14): loss-recipe addition. No methodology dependency for this project’s evaluation slate at v1.2.14; the pin bump records the upstream resolution without changing this project’s reported metrics.
eval-toolkit #51 (feat(preprocessing): spotlighting; CLOSED 2026-05-21; shipped in eval-toolkit v0.44.0; recorded/consumed as dependency maintenance at v1.2.14): defense-side preprocessing pattern (delimit/datamark/encode + sweep). Out of v1.0.0/v1.2.x evaluation slate scope per ADR-039 + the portfolio-repo plan ([[portfolio_plan_approved]] memory); recorded here because the v0.44.0 bump references its closure.
eval-toolkit #52 (feat(stacking): MetaLearner Protocol + LogisticStacker; CLOSED 2026-05-21; shipped in eval-toolkit v0.45.0; recorded/consumed as dependency maintenance at v1.2.16): stacking primitive; out of v1.0.0/v1.2.x evaluation slate scope per ADR-007 (single-rung detectors only at submission tier); recorded here because the v0.45.0 leg of the v0.44→v0.47 bump at v1.2.16 references its closure.

If any of these later become consumption candidates (e.g., a v1.3.x methodology decision opens preprocessing or stacking), a new ledger row will be added at consumption time with the full local file:line + status.

Date	Repo	Issue #	Title	Local site (file:line)	Status
2026-05-15	`brandon-behring/eval-toolkit`	#17	Add optional `n_jobs` parameter to `paired_bootstrap_diff` for internal resample-loop parallelization	`scripts/run_bootstrap_battery.py` (Phase 4 deliverable per ADR-022)	RESOLVED in eval-toolkit v0.42.0; ledger status updated at v1.2.2 per ADR-066 — `et.paired_bootstrap_diff(..., n_jobs=N)` available. Project keeps `n_jobs=1` default per #41 worker-copy memory model (shared-state pattern is opt-in for callers).
2026-05-16	`brandon-behring/eval-toolkit`	#14	Add `plot_roc_curve` (sibling to `plot_pr_curve`)	`src/eval/figures.py` Phase 4 F2 deliverable per ADR-046	RESOLVED in eval-toolkit v0.42.0; consumed at v1.0.x per ADR-066 §B2 — `et.plot_roc_curve` dispatched in `src/eval/figures.py::render_f2_roc_per_rung`; stale annotation removed at v1.2.2 Commits 3-6 consolidated.
2026-05-16	`brandon-behring/eval-toolkit`	#15	Add `plot_pareto_frontier` for cost-vs-performance scatter with frontier overlay	`src/eval/figures.py` Phase 4 F1 deliverable per ADR-046	RESOLVED in eval-toolkit v0.42.0; consumed at v1.0.x per ADR-066 §B1 — `et.plot_pareto_frontier` dispatched in `src/eval/figures.py::render_f1_pareto`; stale annotation removed at v1.2.2 Commits 3-6 consolidated.
2026-05-16	`brandon-behring/eval-toolkit`	#16	Add `plot_slice_metric_heatmap` for `(group_x × group_y × metric)` grids	`src/eval/figures.py` Phase 4 F5 deliverable per ADR-046	RESOLVED in eval-toolkit v0.42.0; consumed at v1.0.x per ADR-066 §B3 — `et.plot_slice_metric_heatmap` dispatched in `src/eval/figures.py::render_f5_slice_heatmap`; stale annotation removed at v1.2.2 Commits 3-6 consolidated.
2026-05-16	`brandon-behring/eval-toolkit`	#18	Wire dedup-holdout golden test against this project’s 50-pair adversarial dataset	`data/dedup_holdout.jsonl` + `scripts/calibrate_dedup.py` + `evals/dedup_calibration.json`	filed at Phase 1 library-first audit (per ADR-047); golden-test contribution candidate
2026-05-16	`brandon-behring/eval-toolkit`	#19	Add cookbook docs: 3 compositional patterns (nested-seed splits + callable-embedder strategy + pairs_across contamination scan)	`src/data/{splits, dedup, audit}.py` (Phase 1 hand-rolls that motivated this)	filed at Phase 1 library-first audit (per ADR-047); docs PR candidate
2026-05-16	`brandon-behring/eval-toolkit`	#20	Generalize `mde_from_ci` to accept `BootstrapCI \\| PairedBootstrapCI` for marginal-MDE use case	`scripts/run_mde.py` Phase 4 deliverable per ADR-046 Q4	RESOLVED in eval-toolkit v0.42.0; consumed at v1.0.x per ADR-066 §B5 — `et.mde_from_ci` generalized; `src/eval/mde.py::mde_from_marginal_ci_record` uses the upstream API directly. Stale `schemas.py:336` annotation removed at v1.2.2 Commits 3-6 consolidated.
2026-05-16	`brandon-behring/eval-toolkit`	#21	Add `block_bootstrap_on_folds` (CV-aware block bootstrap; complement to `cv_clt_ci`)	`src/eval/cross_fold_ci.py` Phase 4 deliverable per ADR-046 Q3 + ADR-024 + A-008	RESOLVED in eval-toolkit v0.42.0; consumed at v1.0.x per ADR-066 §B6 — `et.block_bootstrap_on_folds` shipped; `src/eval/cross_fold_ci.py::compute_block_bootstrap_on_folds` thin-wraps. Stale `cross_fold_ci.py:202` annotation removed at v1.2.2 Commits 3-6 consolidated.
2026-05-17	`brandon-behring/eval-toolkit`	#22	Add `ax: Axes \\| None` kwarg to `plot_metric_bars` (parity with `plot_pr_curve` / `plot_reliability_diagram` / `plot_lift_ci` / `plot_bootstrap_distribution`)	`src/eval/figures.py::render_f6_lodo_breakdown` Phase 4 Commit 4 per ADR-046 Q6	RESOLVED in eval-toolkit v0.42.0; consumed at v1.2.2 per ADR-066 §B4 — `et.plot_metric_bars(ax=...)` shipped; bare-matplotlib bars in `src/eval/figures.py::render_f6_lodo_breakdown` left panel replaced with library-first dispatch in v1.2.2 Commits 3-6 consolidated (the only ACTUAL refactor site in the v1.2.2 series — all others were already-consuming + stale-comment cleanup).
2026-05-17	`brandon-behring/runpod-deploy`	#88	Make SSH-ready timeout configurable (default 240s too aggressive for cold image pulls)	`scripts/runpod_deploy_long_ssh.py` (monkey-patch shim with 600s deadline); Makefile `headline-{frozen-probe,lora,full-ft}` use shim for `run --config`	RESOLVED in runpod-deploy v0.8.2 (configurable `budget.ssh_ready_timeout_sec`; default 900s). Consumed at v1.1.0 per ADR-059: shim DELETED in same commit as pin bump per no-orphaned-code invariant; `budget.ssh_ready_timeout_sec: 600` explicit override added to all 3 headline-*.yaml configs (matches the shim’s 600s value; preserves effective behavior).
2026-05-17	`brandon-behring/runpod-deploy`	#92 (doc) + #93 (PR) + #94 (scan-consumer)	UV_LINK_MODE=copy + UV_CACHE_DIR + UV_PROJECT_ENVIRONMENT off /workspace required on FUSE-mounted /workspace (silent hang under default hardlink mode + git-resolution F_SETLKW hang + HF Trainer atomic-save F_SETLKW hang)	`configs/runpod/headline-{frozen_probe,lora,full_ft}.yaml` `remote_env.exports` (Phase 4 X7 + X8 commits `13f06b8`/`8af73ae`); memory entry `memory/fuse-workspace-needs-uv-link-mode-copy.md` (3 FUSE failure modes); seed manifest + cost ledger (`artifacts/runpod/manual-20260517T155912Z/runpod_deploy_pull_manifest.json` + `evals/cost_ledger.csv` commit `7c1f36c`)	RESOLVED — PR #93 merged + #92/#94 docs/scan-consumer landed in runpod-deploy v0.8.x. v1.1.0 consumes upstream resolutions via the pin bump (ADR-059); existing `remote_env.exports` UV_LINK_MODE=copy + UV_CACHE_DIR + UV_PROJECT_ENVIRONMENT entries retained as belt-and-suspenders (no upstream code requires removal). External verification: uv#17801, MooseFS#380, Linux kernel patch 2025-12-23.
2026-05-17	`brandon-behring/runpod-deploy`	#90 (feat)	`lifecycle.on_success: recycle` — resume paused pods on `run` retry (avoid re-provisioning when only config edits changed)	bypassed orchestrator via manual SSH for `i5p43hfq88zpqt` recovery (Phase 4 re-fire #5); no local code dependency (waiting on upstream feature)	RESOLVED — `lifecycle.on_success: recycle` semantics landed in runpod-deploy v0.8.x. Consumed at v1.1.0 Commit 3 of 3 for the DeBERTa-v3-base medium ablation via new `configs/runpod/headline-deberta.yaml` (per /exploring-options 2026-05-19 Q2 lock — sequential single pod, 2 training fires sharing 1 pod via recycle; saves ~$1-2 + ~3-5 min vs full teardown between truncation strategies). The 3 existing headline configs use `on_success: delete` per ADR-013 pre-teardown checklist; recycle is opt-in.
2026-05-17	`brandon-behring/runpod-deploy`	#97 (feat)	`validate --all` should HTTP HEAD-check `pod.image` against the public registry (catch phantom tags before pod fires)	no local code dependency (validate-time check; consumer-side fix was image-refresh in `configs/runpod/headline-*.yaml` Phase 4 X5 commit `b629181`)	RESOLVED — image-registry HEAD-check landed in v0.8.x; invoked by default within `validate --all`. Consumed automatically at v1.1.0 (no project-side wiring needed); `make headline-dry-run` now reports `[image-registry] ok: 'runpod/pytorch:...' exists on Docker Hub` line for each of the 3 migrated configs. Catches phantom Docker Hub tags before billing.
2026-05-17	`brandon-behring/runpod-deploy`	#98 (docs)	Makefile-recipe pattern doc when `runpod-deploy` is in `[project.optional-dependencies] dev` (consumers hit cryptic PATH error otherwise)	`Makefile` recipes use `uv run runpod-deploy ...` (Phase 4 X4 sed-rewrite commit `a04e016`)	RESOLVED — Makefile-recipe pattern documented in runpod-deploy v0.8.x docs. Consumed at v1.1.0 via continued use of the `uv run runpod-deploy ...` pattern in the Makefile (now also covers the 3 callers previously delegating to the deleted `scripts/runpod_deploy_long_ssh.py` shim — they now use `uv run runpod-deploy run ...` directly per #88 closure).
2026-05-18	`brandon-behring/eval-toolkit`	#39	Add `is_metric_defined_for_slice(slice_name, metric_name) -> bool` primitive (or equivalent `eval_toolkit.metrics.skip_single_class_slices` filter) so consumers don’t hand-roll the “AUROC/AUPRC undefined on single-class slices” rule	`src/eval/slice_analysis.py::is_metric_defined_for_slice` (v1.0.0 Item 4 single-class filter) + `src/eval/marginal_bootstrap.py::compute_marginal_battery` source-level filter + `scripts/run_cv_clt_ci.py` orchestrator filter	RESOLVED in eval-toolkit v0.39.0 (closed 2026-05-18 20:13 UTC). Consumed at v1.0.6: `src/eval/slice_analysis.py::is_metric_defined_for_slice` now thin-wraps upstream `eval_toolkit.is_metric_defined_for_slice(metric_name, is_single_class=slice_name in SINGLE_CLASS_SLICES)`; local `SINGLE_CLASS_INCOMPATIBLE_METRICS` constant removed (uses upstream `eval_toolkit.SINGLE_CLASS_INCOMPATIBLE_METRICS`). Project-specific knowledge of which slice NAMES are single-class stays local (`SINGLE_CLASS_SLICES` frozen-set kept).
2026-05-18	`brandon-behring/eval-toolkit`	#40	Relax `LeakageCheck` Protocol `name` member from settable-attr to read-only `@property` (or `Final[str]`) so `@dataclass(frozen=True)` LeakageCheck implementations like `CrossSplitLeakageCheck` are mypy-strict-compatible	`src/data/audit.py:222` — `cast(LeakageCheck, leakage_check)` workaround at the call site	RESOLVED in eval-toolkit v0.39.0 (closed 2026-05-18 20:13 UTC). Consumed at v1.0.6: `cast(LeakageCheck, leakage_check)` workaround removed at `src/data/audit.py:222`; unused `LeakageCheck` import removed from line 34. Confirmed `LeakageCheck.name` is now an `@property` via runtime `inspect` check.
2026-05-18	`brandon-behring/eval-toolkit`	#41	`_parallel.parallel_map` joblib worker copies the entire input spec (including DataFrames passed in the tuple) — OOM-risk on many-core machines when each cell carries a ~30-100 MB predictions DataFrame × N workers	`src/eval/marginal_bootstrap.py::_compute_one_cell_or_none` worker + `compute_marginal_battery` default `n_jobs=1` (opt-in parallelism)	RESOLVED in eval-toolkit v0.39.0 (closed 2026-05-18 20:13 UTC) — upstream documented the worker-copy memory model in `methodology/parallelism.md`. Our defensive `n_jobs=1` default at `src/eval/marginal_bootstrap.py::compute_marginal_battery` remains correct + now well-documented upstream. No code change required at v1.0.6 (the upstream resolution was docs-only; the worker-copy semantics themselves are joblib-loky design — shared-state pattern remains opt-in for callers).
2026-05-18	`brandon-behring/eval-toolkit`	#43	Add `fit_platt_binary` + `fit_beta_binary` calibrators (binary-class scalar-prob adapters; siblings of `fit_temperature_binary` shipped in v0.35.0)	`src/eval/calibration_battery.py` — Platt + Beta deferred per ADR-023 original scope; v1.0.8 wants to consume upstream rather than hand-roll	RESOLVED in eval-toolkit v0.40.0 (closed 2026-05-19 00:17 UTC; ~17 min after filing — fastest upstream turnaround of the v1.0.x series). Consumed at v1.0.8 per ADR-056: `src/eval/calibration_battery.py` refactored to use `fit_platt_binary` + `fit_beta_binary` (canonical `(params, apply)` shape) + `fit_temperature_binary` (the multi-class `fit_temperature` API we had been using was retired) + local `fit_isotonic_binary_local` adapter pending #44. ADR-023 frontmatter `superseded_by: [056]` in-place.
2026-05-19	`brandon-behring/eval-toolkit`	#44	Add `fit_isotonic_binary` for shape consistency with `fit_temperature_binary` + `fit_platt_binary` + `fit_beta_binary` (returns `(None, apply)` since isotonic is non-parametric)	`src/eval/calibration_battery.py::fit_isotonic_binary_local` — thin adapter wrapping `fit_isotonic_calibrator` to return `(None, apply)` shape	RESOLVED in eval-toolkit v0.42.0 (closed 2026-05-19 01:20 UTC; release 01:25 UTC — 5 min between closure + PyPI publish). Consumed at v1.0.9: `src/eval/calibration_battery.py` refactored to use upstream `fit_isotonic_binary` directly; `fit_isotonic_binary_local` adapter + orphaned `Callable` import deleted in same commit per no-orphaned-code invariant. eval-toolkit pin bumped v0.40.0 → v0.42.0 (v0.41.0 skipped because it predated the #44 close). 4-of-4 binary calibrator family now lands on canonical upstream `_binary` API.
2026-05-19	`brandon-behring/eval-toolkit`	#48	Add `ood_dataset_from_manifest` helper for manifest-driven OOD dataset assembly	`pyproject.toml` + `decisions/library_imports.md` dependency pin only in v1.2.8; no current local call site	RESOLVED in eval-toolkit v0.43.0; recorded/consumed as dependency maintenance at v1.2.8. No project methodology, model, data, or artifact output changes; future OOD dataset assembly can use the upstream helper instead of local glue if that work is opened.
2026-05-19	`brandon-behring/eval-toolkit`	#49	Add `character_injection` core-6 prompt-injection pattern support (+ advanced-6 extension shipped in v0.47.0)	`pyproject.toml` + `decisions/library_imports.md` dependency pin only in v1.2.8 + v1.2.16; no current local call site	RESOLVED in eval-toolkit v0.43.0 (core-6) + v0.47.0 (advanced-6 + sweep unification); recorded/consumed as dependency maintenance at v1.2.8 + v1.2.16. No prompt family is added to this project’s evaluation slate; the reviewer-facing v1.0.0 pin and live-site results remain unchanged.
2026-05-19	`brandon-behring/eval-toolkit`	#53	Add `ActivationDeltaProbe` for activation-delta analysis	`pyproject.toml` + `decisions/library_imports.md` dependency pin only in v1.2.8; no current local call site	RESOLVED in eval-toolkit v0.43.0; recorded/consumed as dependency maintenance at v1.2.8. Activation-delta demos/strategy remain separate from this patch; no analysis claim is added here.
2026-05-19	`brandon-behring/eval-toolkit`	#36	Add inline bootstrap CI on every metric (Inspect-AI / lm-eval scorecard pattern; `with_ci=True` kwarg on `pr_auc`/`roc_auc`/`brier_score`/`ece` variants)	`src/eval/marginal_bootstrap.py:31-32,42-43,110` + `src/eval/calibration_battery.py:47-52` — current explicit `bootstrap_ci(metric_fn, ...)` orchestrator would simplify with inline-CI sugar	RESOLVED in eval-toolkit v0.46.0 (different API shape than originally proposed); recorded as dependency maintenance at v1.2.16. Upstream chose `scorecard(metric_specs, ..., bootstrap=True)` + `Scorecard` (read-only `Mapping[str, MetricResult]`) + `metric_specs` (type-safe metric spec objects) over the original `with_ci=True` per-function kwarg — addresses the same use case (one-call multi-metric bootstrap CIs) with type-safe contracts. Our project’s contribution comment at #36 issue-comment was the request. Consumer-side migration deferred to a future v1.3.x cycle if the `bootstrap_ci(metric_fn, ...)` orchestrator pattern is replaced with `scorecard(...)`.
2026-05-22	`brandon-behring/eval-toolkit`	#71	`audit.reader_value_bindings` — validate detector→value bindings in reader-prose Markdown (catches the v1.3.2 P1-1 Act 0 bug class where canonical 0.974 was bound to TF-IDF instead of LoRA)	`scripts/audit_value_bindings.py` (v1.3.8) wraps `eval_toolkit.audit_value_bindings.validate_reader_value_bindings` with the consumer’s `BINDINGS` dict + `DETECTOR_ALIASES` + `METRIC_ALIASES` regex maps.	RESOLVED in eval-toolkit v1.0.3; consumed at v1.3.8. Filed at v1.3.3; PR merged into upstream main; v1.0.3 release shipped 2026-05-26 02:35Z as second member of the audit-validator family (after #73 / audit_citation_alignment at v1.0.1).
2026-05-22	`brandon-behring/eval-toolkit`	#72	`audit.sister_doc_concept_drift` — detect cross-doc semantic drift on shared concept tokens (catches the v1.3.2 P1-2 T1-tier-meaning bug class where one doc said T1 = full cloud rerun and the linked sister said T1 = laptop smoke)	No existing script in this project; gap surfaced by Codex external audit + Claude proactive Phase-2 sweep.	filed at v1.3.3; triaged 2026-05-24 (P3, `enhancement,tracked`, #3 of 3 in audit/ subpackage drop per maintainer sequencing); awaiting upstream PR after #74 + #71 land
2026-05-22	`brandon-behring/eval-toolkit`	#73	`audit.adr_citation_alignment` — validate that “per ADR-NNN” citations match the cited ADR’s actual subject (catches the v1.3.2 P1-2 ADR-029-cited-for-tier-lock + ADR-039-cited-for-cost bug class)	`scripts/audit_citation_alignment.py` (v1.3.7) wraps `eval_toolkit.audit_citation_alignment.validate_citations` with the consumer’s `CATEGORY_KEYWORDS` map + builds `ADRSubject` records from `decisions/ADR-*.md` frontmatter.	RESOLVED in eval-toolkit v1.0.1; consumed at v1.3.7. Filed at v1.3.3; PR #74 merged 2026-05-25T22:51:15Z (restructured pre-merge to flat module per upstream ADR 0001: `src/eval_toolkit/audit_citation_alignment.py` not subpackage); shipped in v1.0.1 release ~22:55Z; v1.0.2 release ~23:30Z added #76 cleanup batch (same module blob).
2026-05-26	`brandon-behring/eval-toolkit`	#82	`audit_citation_alignment` — multi-ADR-citation context-awareness (R15 follow-on; Layer 2 scope + Layer 3 pairing extension to the citation-alignment validator). 188 warnings on consumer repo at `tree/v1.3.13` HEAD with `eval-toolkit==1.3.0`. Three patterns: (α) dense multi-ADR-citation list within one paragraph; (β) spec/table per-row ADR citations; (γ) multi-claim sentence with ADR per-clause. Same architectural class as v1.3.0 Layer 3 pairing rules just resolved for `audit_value_bindings` via #81.	`scripts/audit_citation_alignment.py` SKIP_PATTERNS mirrored at v1.3.13 (added `SUBMISSION.md`, `_codex.md`, `AUDIT_CLAUDE_`, `draft.md`, `draft_review.md` for parallel discipline with audit_value_bindings). No further consumer-side workaround applied.	filed at v1.3.13 (2026-05-26); sixth library-first cycle (R15-equivalent following R11→R14 closure for audit_value_bindings). Proposed approach: Path A (`scope='narrative'` mirror of v1.1.0) + Path B (pairing rules mirror of v1.3.0). Acceptance criterion: 188 → ≤20 warnings on consumer HEAD. Consumer-side BUNDLED HARD-gate promotion (audit_value_bindings + audit_citation_alignment together per v1.3.8 plan) deferred to v1.3.14+ pending upstream resolution. Awaiting upstream triage.
2026-05-26	`brandon-behring/eval-toolkit`	#81	`audit_value_bindings` — cross-detector list-grammar follow-on (v1.3.0+ deferred work per upstream ADR 0005 + Round 13 ledger). Three patterns: (1) “X for the frozen probe and Y for TF-IDF” `for X` list connective; (2) “A’s metric is X against B’s Y” possessive comparative; (3) “in-pool X → cross-family Y for the trained detectors” sentence-shift contrast. After v1.2.0 T1-T4 context filters resolved 32 of 36 residual FPs, the remaining 4 are pure cross-detector list-grammar that the v1.2.0 positional heuristic can’t disambiguate.	`scripts/audit_value_bindings.py` migrated at v1.3.11 from 2-tuple to BindingKey schema + SLICE_ALIASES + `scope='narrative'`; pin bumped v1.1.0→v1.2.0 at v1.3.12 (T1-T4 context filters); pin bumped v1.2.0→v1.3.0 at v1.3.13 (Layer 3 pairing rules; no script changes — Tier-1 ADDITIVE). Filed against `README.md:71`, `RESULTS.md:171` (×2), `WRITEUP_PAPER.md:304`.	RESOLVED in eval-toolkit v1.3.0 (2026-05-26T22:27Z; ~51min compressed cycle); consumed at v1.3.13. Upstream shipped Layer 3 pairing rules (4 patterns A/B/C/D per ADR 0006) closing all 4 consumer residuals + emerging Pattern D during upstream dogfood. Consumer dogfood at v1.3.13: 4→0 warnings (100% reduction); closes R11→R14 cycle on audit_value_bindings. Upstream Round 14 ledger explicitly invites HARD-gate promotion; consumer side defers to v1.3.14+ pending #82 (audit_citation_alignment context-awareness) for the bundled-HARD plan per v1.3.8 CHANGELOG.
2026-05-26	`brandon-behring/eval-toolkit`	#80	`audit_value_bindings` — extend `BINDINGS` schema with slice axis (3-tuple `(detector, metric, slice)` keys) so same-detector + same-metric across different slices don’t cross-flag. Architectural follow-on to #71: the current 2-tuple schema cannot distinguish direct-validation AUPRC from pooled-OOD AUPRC, point estimates from paired deltas, or detector values from random-floor values — produces ~95% false-positive rate on the consumer repo. F1 (TF-IDF/LoRA value-binding swap in `WRITEUP_PAPER.md:545`) was correctly flagged at v1.3.8 but hid in 96 warnings of noise. v1.3.9 ships the F1 fix; the 95+ other false-positive warnings persist until upstream lands.	`scripts/audit_value_bindings.py` migrated at v1.3.11 from 2-tuple `(detector, metric)` schema to upstream-recommended `BindingKey` structured-key schema per ADR 0005; `BINDINGS` expanded from 2 entries to 15 covering 5 detectors × 2 metrics × 4 slices (direct_validation, pooled_ood, jbb, xstest); added `SLICE_ALIASES` mapping + `scope='narrative'` content-type filter. v1.3.12 follow-on: pin bump v1.1.0→v1.2.0 (Tier-1 ADDITIVE; T1-T4 context-aware narrative filters activate automatically via existing `scope='narrative'`).	RESOLVED in eval-toolkit v1.1.0 (2026-05-26T20:11Z; ~2h compressed cycle); consumed at v1.3.11. Upstream shipped MORE than proposed: `BindingKey` (frozen dataclass; forward-extensible for future axes per ADR 0005) instead of bare 3-tuple, plus `scope='narrative'` content-type filter (excludes tables + bracketed expressions + code blocks) — the second architectural layer beyond what was filed. Consumer dogfood at v1.3.11: 96→36 warnings (62% reduction on this repo; upstream dogfood reported 76% against v1.3.9 baseline before v1.3.10 additions). Remaining 36 were upstream-acknowledged positional-heuristic limitations (multi-detector list pairings, sub-clause sentence boundaries, “vs” comparisons). v1.2.0 follow-on shipped 2026-05-26T21:20Z (~1h after v1.1.0 closure) addressing 32 of 36 residuals via T1-T4 context-aware narrative filters; consumed at v1.3.12: 36→4 warnings (96% total reduction from 96-baseline). Remaining 4 residuals are pure cross-detector list-grammar (v1.3.0+ upstream parser-level work — see separate filed issue). HARD-gate promotion still deferred per v1.3.8 bundled-promotion plan; awaiting cross-detector list-grammar resolution OR observation window confirmation.

How to use this ledger

When you discover a gap during Phase 1+ work:

File the upstream issue with tracked label (use gh issue create)
Add a row to this ledger with the issue URL + the local file:line that depends on it
If a workaround is unavoidable, leave a TODO comment in code citing the issue number; remove when upstream lands

A workaround that hand-rolls a primitive without first filing the issue is an anti-pattern (see SPEC_GREENFIELD.md §7 Anti-patterns).

Test-coverage-gap entries (per ADR-028)

When a coverage gap surfaces under the 70% floor that would be better addressed by an upstream library test (e.g., a test pattern that should live in eval-toolkit’s harness coverage, or a runpod-deploy preflight scenario) rather than absorbed as a low-value local test:

File the upstream issue with the proposed test pattern (sketch, not implementation), the rationale (why upstream is the right home), the local file:line that depends on the absent test, and the tracked label.
Add a row to this ledger with the [test-coverage-gap] prefix in the Title column + the issue URL + the local file:line.

When a gap genuinely cannot be filed upstream (project-specific glue) AND cannot be cheaply tested locally:

Leave a code comment with the rationale (e.g., # noqa: COV — Phase 0-06 deferral; project-specific orchestration glue, see decisions/upstream_issues.md).
Add a row to this ledger with the [not-applicable] prefix in the Title column + the local file:line + the deferral rationale in the Status column.

Both forms preserve the discipline trail without forcing local anti-tests under the 70% floor.

A coverage gap that gets papered over with a no-op test (or with # pragma: no cover and no ledger entry) is an anti-pattern.