Library imports — discipline ledger

How to read this page. This is a reference ledger for auditing library-first discipline. It is not a first-read narrative page. Use it when you want to verify that generic evaluation, orchestration, and research primitives came from the load-bearing libraries instead of being reimplemented locally.

This repo uses three load-bearing libraries (see SPEC_GREENFIELD.md §Tech-Stack). Anything implementable as a library primitive is filed upstream (see upstream_issues.md); this ledger lists what is actually imported / invoked from each library. Updated incrementally as code lands.

The ledger is positive evidence: not just “we don’t hand-roll” but “here is exactly what we use from each library.” Reviewer-readable; CI-friendly.

At-a-glance

  • eval-toolkit: metrics, bootstrap CIs, calibration, thresholding, leakage/dedup primitives, and plotting helpers.
  • runpod-deploy: RunPod validation, launch, lifecycle, manifest, cost, and FUSE-workspace operational recipes.
  • research_toolkit: research dossier production skills used before Phase 0 decisions were locked.
  • Project-local scripts remain allowed when they glue project-specific data, paths, or submission audit surfaces around upstream primitives.

Version pinning lock (Phase 0-08 per ADR-036)

Library Pinned version pyproject.toml specifier
eval-toolkit v1.3.0 (range >=1.3.0,<2) eval-toolkit>=1.3.0,<2 (PyPI install per ADR-055; range pin per v1.0 stability-contract; v1.3.13 tightened lower bound to require Layer 3 pairing rules shipped in v1.3.0). Pre-v1.0 trajectory (closures only; per v1.3.14 cell-size compression — full prose lives in CHANGELOG + decisions/upstream_issues.md): v0.31→v0.34 (X8); v0.34→v0.39 (v1.0.6); v0.39→v0.40 PyPI (v1.0.8 #44); v0.40→v0.42 (v1.0.9 #44); v0.42→v0.43 (v1.2.8 ledger); v0.43→v0.44 (v1.2.14 #50+#51); v0.44→v0.47 (v1.2.16 ledger); v0.47→v0.50 (v1.3.5); v0.50→v1.0 (v1.3.6 stability opt-in); v1.0→v1.0.2 (v1.3.7 closes #73); v1.0.2→v1.0.3 (v1.3.8 closes #71). Recent (consumer-noise reductions retained inline): v1.0.3→v1.1.0 at v1.3.11 (BindingKey + scope='narrative' per upstream ADR 0005; closes #80; 62% noise reduction); v1.1.0→v1.2.0 at v1.3.12 (T1-T4 context-aware narrative filters, Tier-1 ADDITIVE per ADR 0005 amendment, follow-on to #80; 96% total reduction from 96-warning baseline = 4 residuals); v1.2.0→v1.3.0 at v1.3.13 (Layer 3 pairing rules per upstream ADR 0006, Tier-1 ADDITIVE; closes #81; 100% total reduction — 0 residuals on audit_value_bindings, closing R11→R14 cycle).
runpod-deploy v0.8.4 runpod-deploy==0.8.4 (PyPI install per ADR-059 narrow supersession of ADR-036; mirrors ADR-055 for eval-toolkit) — moved to [project.optional-dependencies] dev per Phase 4 X3 (validator-flagged 2026-05-17; runpod-deploy is a local orchestrator, pod never imports it). v1.1.0 bump v0.7.7 → v0.8.4 consumed: #88 budget.ssh_ready_timeout_sec (replaces deleted scripts/runpod_deploy_long_ssh.py shim); #90 lifecycle.on_success: recycle (DeBERTa-v3-base ablation single-pod 2-fire); #97 validate --check-image-registry (default in validate --all); #92/#93/#94/#98 upstream resolutions (FUSE workaround docs + Makefile-recipe pattern). v0.8.3 BREAKING removal of stop: schema migrated to lifecycle: in Commit 1 of 3 (BEFORE this pin bump).
research_toolkit v1.9.1 research_toolkit @ git+https://github.com/brandon-behring/research_toolkit@v1.9.1

Pinning strategy (per ADR-036): tag pin + freeze for submission window (Phase 0-08 close → v1.0.0 submission tag per ADR-033); uv.lock provides byte-level reproducibility on top.

Python pin (per ADR-037): requires-python = ">=3.13" + .python-version = 3.13.

Bump triggers — exactly four: 1. Blocking upstream bug that breaks a use-pattern documented below. 2. Critical security fix (CVE-grade) in the upstream. 3. Post-submission reviewer-feedback patch per ADR-033 v1.0.x discipline. 4. Dependency / ledger maintenance bump that records resolved upstream issues without methodology / model / data / compute change (per ADR-066; precedent at v1.2.8: bumped eval-toolkit v0.42→v0.43 to record #48 + #49 + #53 closures as already-resolved-not-consumed; ledger updates only).

Routine “the upstream has a new release” is NOT a bump trigger. Each bump produces a new commit + an entry in decisions/upstream_issues.md referencing the trigger; bumps do NOT supersede ADR-036 (the discipline is locked; specific versions move). Freeze expires at v2.0.0 per ADR-033 major-bump discipline.

Secrets discipline (Phase 0-08 per ADR-035)

All consumer libraries (huggingface_hub, openai, anthropic, runpod-deploy CLI) discover tokens via their default env-var auto-discovery. Token storage uses a three-store split aligned with execution context — local .env (gitignored; per ADR-035) + RunPod pod-secrets via runpod-deploy config + GH Actions repo Secrets. .env.example committed at repo root as a placeholder template enumerating the four canonical env vars (HF_TOKEN + RUNPOD_API_KEY + OPENAI_API_KEY + ANTHROPIC_API_KEY). See ADR-035 for rotation protocol + preflight verification.

eval-toolkit imports (https://github.com/brandon-behring/eval-toolkit)

Primitive Imported in Purpose
eval_toolkit.splits.SourceDisjointKFoldSplitter + eval_toolkit.harness.EvalSlice src/data/splits.py::make_splits (Phase 1 library-first carryforward refactor per ADR-047 Commit 2) LODO source-disjoint k-fold partition (k=4 per ADR-016 Q2 + TRAIN_POSITIVE_SOURCES tuple length); upstream docstring notes “Generalizes the source-disjoint split pattern from prompt-injection-sdd” — abstracted from this project’s predecessor. Project glue remaps upstream-shuffled fold order back to TRAIN_POSITIVE_SOURCES tuple order (deterministic fold_id-to-source mapping); composes per-seed stratified 80/20 train/val + benigns-in-every-train-pool discipline on top
eval_toolkit.text_dedup.near_dedup src/data/dedup.py::dedup_within_source + dedup_cross_source_benigns (ADR-047 Commit 3) Greedy forward-scan near-dedup at threshold 0.80 per ADR-016 Q4. dedup_within_source invokes per-(source, label) cell; dedup_cross_source_benigns invokes on priority-first-concatenated DataFrame (LMSYS-priority tiebreak naturally encoded by forward-scan over priority-first ordering per ADR-016 Q5). Project glue maps upstream’s (dropped_idx, kept_idx, similarity) triples to project-specific dropped_records dicts
eval_toolkit.text_dedup.EmbeddingCosineStrategy(embedder=compute_embeddings) src/data/dedup.py::_embedding_strategy (ADR-047 Commit 3) Strategy class passed to near_dedup + pairs_across; toolkit owns cosine + k-NN, project owns the embedder (compute_embeddings is a MiniLM-L6-v2 sentence-transformer wrapper per ADR-016 Q4)
eval_toolkit.text_dedup.EmbeddingCosineStrategy.pairs_across src/data/dedup.py::drop_train_test_leakage + src/data/audit.py::compute_contamination_scan (ADR-047 Commit 3 + Commit 4) Per-(candidate_train_val) top-1 max-cosine to test set; k=1 returns shape (n_query, 1) similarities + indices. Used in cosine-leakage layer (exact-hash layer stays as set-intersection project-specific layer per ADR-016 Q3). Also used in compute_contamination_scan for per-source max-cosine-to-reference scan per ADR-041 Q6 + A-006
eval_toolkit.text_dedup.EmbeddingCosineStrategy.pairs_within scripts/build_dedup_holdout.py::_enumerate_within_source_pairs (ADR-047 Commit 4) Within-source pair enumeration for the 50-pair dedup-holdout calibration corpus per ADR-041 Q5; k_neighbors=N-1 covers every other row; project glue dedupes ordered (i, j>i) pairs + assigns each to its cosine band {[0.55-0.65), [0.65-0.75), …, [0.95-1.00)}
eval_toolkit.leakage.CrossSplitLeakageCheck + eval_toolkit.leakage.run_leakage_checks + eval_toolkit.harness.EvalSlice src/data/audit.py::compute_leakage_report (ADR-047 Commit 4) Per-(fold, seed) train+val vs test leakage detection via upstream Check Protocol. ADR-047 acceptance criterion specified [ExactDuplicateCheck + NearDuplicateCheck + CrossSplitLeakageCheck]; implementation uses only CrossSplitLeakageCheck since the other two operate within-split and would always report zero findings post-dedup_within_source (which runs upstream in the data pipeline per ADR-041 Q7). Project-dict output schema preserved by extracting test-side drop count from LeakageFinding.drop_indices
eval_toolkit.bootstrap.bootstrap_ci src/eval/marginal_bootstrap.py::compute_marginal_bootstrap_cell (landed at Phase 4 Commit 2 per ADR-046 Q1); scripts/run_marginal_bootstrap.py (landed at Phase 4 Commit 5 orchestrator) Per-rung marginal BCa-bootstrap CI; 10K iterations @ seed=1 headline + 10K @ seed=2 stability check (per ADR-022); validated through MarginalBootstrapCellModel
eval_toolkit.bootstrap.paired_bootstrap_diff scripts/run_bootstrap_battery.py (Phase 3 Commit 5, landed) Rung-vs-rung paired-bootstrap Δ-CI on persisted row-level predictions; full-pairwise persistence per ADR-045 Q6 (~C(rungs, 2) × slices × metrics cells); percentile CI per bootstrap.py:489 (per ADR-022 + ADR-006)
eval_toolkit.bootstrap.paired_bootstrap_ece_diff scripts/run_bootstrap_battery.py (landed; Phase 4) Paired-bootstrap Δ-CI specifically for ECE (per ADR-023 calibration battery + ADR-022 paired-across-rungs)
eval_toolkit.bootstrap.cv_clt_ci src/eval/cross_fold_ci.py::compute_cross_fold_ci_cell (landed at Phase 4 Commit 2 headline + Commit 3 spoke); scripts/run_cv_clt_ci.py (landed at Phase 4 Commit 5 orchestrator) Cross-fold CI via Bayle 2020 Theorem 3.1 on per-fold metric vector (K folds x S seeds-per-fold reduced to K fold means per ADR-022 multi-seed protocol); headline cross-fold CI machinery per ADR-024; validated through CrossFoldCIModel. Each headline cv_clt CI is paired with the upstream block_bootstrap_on_folds spoke consumed at v1.2.2 after #21 closed + auto-flag column per A-008
eval_toolkit.metrics.pr_auc + roc_auc (entry above for Phase 3 Commit 4 landing) scripts/run_metrics_battery.py (Phase 3 Commit 5) Aggregator script orchestrates per-(rung, fold, seed, slice) calls to src/eval/slice_analysis.py::compute_metric_record (no separate recall_at_fpr primitive exists in eval-toolkit; recall@FPR is computed via TargetFPRSelector(t).select(y, s).recall wrapped in src/eval/slice_analysis.py::compute_recall_at_fpr per ADR-021)
eval_toolkit.metrics.expected_calibration_error_equal_mass src/eval/calibration_battery.py (Phase 3 Commit 3, landed) Headline ECE-equal-mass(n_bins=15, quantile binning) per ADR-023
eval_toolkit.metrics.expected_calibration_error + _debiased + _l2 + _l2_debiased src/eval/calibration_battery.py (Phase 3 Commit 3, landed) Full 4-ECE matrix for methodology spoke per ADR-023
eval_toolkit.metrics.brier_score + brier_decomposition src/eval/calibration_battery.py (Phase 3 Commit 3, landed) Headline Brier per rung + spoke decomposition (reliability/resolution/uncertainty) per ADR-023
eval_toolkit.calibration.reliability_curve src/eval/calibration_battery.py (Phase 3 Commit 3, landed) Reliability diagrams per rung (equal-mass quantile binning) for spoke per ADR-023
eval_toolkit.calibration.fit_temperature src/eval/calibration_battery.py (Phase 3 Commit 3, landed) Temperature scaling calibrator fit on val per-(rung, fold, seed) per ADR-023 + ADR-011 Guarantee 6
eval_toolkit.calibration.fit_isotonic_binary src/eval/calibration_battery.py::fit_calibrators_binary (v1.0.9; upstream v0.42.0 per eval-toolkit#44 close) Isotonic regression calibrator with canonical (None, apply) _binary shape; replaces v1.0.8 fit_isotonic_binary_local adapter. 4-of-4 binary calibrator family (temperature + isotonic + Platt + Beta) now on upstream _binary API per ADR-023 + ADR-056.
eval_toolkit.calibration.maximum_calibration_error src/eval/calibration_battery.py (Phase 3 deliverable; audit-only; not yet wired) Worst-bin calibration error dumped to evals/calibration/per_obs_audit.parquet per ADR-023
eval_toolkit.bootstrap.mde_from_ci src/eval/mde.py::mde_from_paired_ci_record + src/eval/mde.py::mde_from_marginal_ci_record; scripts/run_mde.py (landed at Phase 4 Commit 5 orchestrator sweeping ~100 cells) MDE on every reported CI per ADR-006. The generalized upstream API handles paired and marginal CI records after #20 closed in eval-toolkit v0.42.0 and was consumed at v1.2.2.
eval_toolkit.plotting.plot_pr_curve src/eval/figures.py::render_f3_pr_per_rung (landed at Phase 4 Commit 4 per ADR-046 Q6) F3 precision-recall overlay per rung; library-first direct dispatch onto a shared axes via the ax= kwarg
eval_toolkit.plotting.plot_reliability_diagram src/eval/figures.py::render_f4_reliability_triptych (landed at Phase 4 Commit 4 per ADR-046 Q6) F4 reliability triptych — invoked 3x for raw + temperature + isotonic interventions on a 1x3 subplot grid
eval_toolkit.plotting.plot_bootstrap_distribution src/eval/figures.py::render_f7_dual_policy_grid (landed at Phase 4 Commit 4 per ADR-046 Q6) F7 dual-policy operating-point grid sub-panels; one panel per (rung, policy) cell with reachability asterisks per ADR-025 + A-009
eval_toolkit.plotting.plot_lift_ci src/eval/figures.py::render_f6_lodo_breakdown (landed at Phase 4 Commit 4 per ADR-046 Q6); scripts/render_figures.py::render_f2_frozen_vs_lora_paired_delta (ADR-062) CI visualization via library primitive. ADR-062 reuses it for the reviewer-facing paired delta figure rather than hand-rolling CI whiskers
eval_toolkit.plotting.plot_slice_metric_heatmap scripts/render_figures.py::render_f3_slice_grid (ADR-062) Reviewer-facing per-slice AUPRC grid from canonical marginal bootstrap artifacts; project glue overlays N/A labels for single-class slices
eval_toolkit.plotting.save_figure src/eval/figures.py (landed at Phase 4 Commit 4 export contract); scripts/render_figures.py (landed at Phase 4 Commit 5 orchestrator + ADR-062 canonical figure rewrite) Provenance-aware figure persistence per ADR-030 / ADR-062 — writes {stem}.meta.json sidecar carrying figure_id + adr + data_mode + source_artifacts + git commit + timestamp + matplotlib_version for every SVG/PDF/PNG output
eval_toolkit.plotting.set_plot_style + PALETTE src/eval/figures.py (every renderer; landed at Phase 4 Commit 4); scripts/render_figures.py (ADR-062 canonical F1-F5 slate) Consistent styling across reviewer-facing figures; PALETTE’s positive/baseline/accent colors used for canonical bars, target lines, and grouped metric panels
eval_toolkit.thresholds.TargetFPRSelector src/eval/operating_points.py (Phase 3 Commit 4, landed); scripts/fit_dual_policy_thresholds.py (Phase 3 Commit 5) Detection-policy threshold fit on val per-(rung, fold, seed); FPR ≤ 1% target (per ADR-025)
eval_toolkit.thresholds.TargetRecallSelector src/eval/operating_points.py (Phase 3 Commit 4, landed); scripts/fit_dual_policy_thresholds.py (Phase 3 Commit 5) Verification-policy threshold fit on val per-(rung, fold, seed); recall ≥ 99% target (per ADR-025); honest infeasibility reporting via try/except RuntimeError → target_reachable=False per A-009
eval_toolkit.metrics.metrics_at_threshold src/eval/operating_points.py (Phase 3 Commit 4, landed); also src/eval/slice_analysis.py via use_metrics_at_threshold_for_diagnostic wrapper At-threshold metrics dict (recall, fpr, precision, etc.) used by both dual-policy fit + diagnostic dumps
eval_toolkit.metrics.pr_auc + roc_auc src/eval/slice_analysis.py::compute_metric_record (Phase 3 Commit 4, landed) Rank-based descriptive metrics per ADR-006 + ADR-021 + ADR-022
eval_toolkit.bootstrap.paired_bootstrap_op_point_diff scripts/run_bootstrap_battery.py (landed; Phase 4) Two-level bootstrap CI for dual-policy operating-point metrics (refit threshold per val resample, apply on test resample, paired diff); per ADR-025 + ADR-022 per-(seed) threshold protocol
eval_toolkit.metrics.metrics_at_threshold scripts/fit_dual_policy_thresholds.py + src/eval/operating_points.py (Phase 3 deliverable) Compute (precision, recall, FPR, F1) at fitted threshold; per ADR-025 dual-policy reporting layout
Glue: joblib.Parallel(n_jobs=-1) (NOT eval-toolkit; project-specific orchestrator-layer parallelization) scripts/run_bootstrap_battery.py (landed; Phase 4) Parallelize ~10000 independent CI computations across 64-core Threadripper; library-first discipline preserves primitives as single-threaded shipped, parallelism is at call-site (per ADR-022)

Audit primitives (v1.3.7+; ports the audit-script-gap upstream batch #71/#72/#73)

Primitive Imported in Purpose
eval_toolkit.audit_citation_alignment.validate_citations + ADRSubject + CitationMisalignment + extract_adr_subject_category scripts/audit_citation_alignment.py::main (v1.3.7 Action 1; SOFT CI gate; HARD-gate promotion bundled with audit_value_bindings at a future v1.3.X) Validate that “per ADR-NNN” citations in reader-facing markdown match the cited ADR’s actual subject category. Catches the v1.3.2 P1-2 bug class (ADR-029 cited for reproducibility tier-lock where ADR-029 is actually test_markers). Consumer-side glue: builds ADRSubject map from decisions/ADR-*.md frontmatter; supplies consumer’s CATEGORY_KEYWORDS dict (11 categories at v1.3.7 seed). Upstream primitive shipped in v1.0.1 (flat module per upstream ADR 0001; closes eval-toolkit#73).
eval_toolkit.audit_value_bindings.validate_reader_value_bindings + Match + Violation + ValueBindingsReport scripts/audit_value_bindings.py::main (v1.3.8 Action 1; SOFT CI gate; HARD-gate promotion bundled with audit_citation_alignment) Validate that reader-prose (detector, metric, value) triples match the consumer’s canonical bindings table. Catches the V1.3.1 ADR-080 bug class: prose pairing a detector name with the WRONG canonical value (e.g., “TF-IDF reaches 0.974 AUPRC” when canonical TF-IDF AUPRC=0.971 and 0.974 is LoRA’s value). Consumer-side glue: supplies a BINDINGS dict mapping (detector_canonical, metric_canonical) -> expected_value (initial seed at v1.3.8: 2 entries for the V1.3.1 motivating pair; expansion deferred to HARD-gate promotion per /exploring-options Q3) + DETECTOR_ALIASES + METRIC_ALIASES regex maps for surface-form matching. Upstream primitive shipped in v1.0.3 (flat module per upstream ADR 0001; closes eval-toolkit#71).

runpod-deploy imports (https://github.com/brandon-behring/runpod-deploy) [v0.8.4 pinned]

CLI / module Invoked in Purpose
runpod-deploy validate --all Makefile target validate-runpod-config Preflight schema + DC reachability + GPU stock check before any billed run (per ADR-020)
runpod-deploy run --dry-run Makefile target headline-dry-run Cost preview without provisioning — hits runpodctl + GraphQL pricing (per ADR-020)
runpod-deploy run --config ... Makefile target headline-cloud Canonical headline run (per ADR-020)
runpod-deploy logs --config ... live-tail during runs Active-pod log streaming
runpod-deploy stop --state-file ... emergency teardown Cost-cap breach + ADR-013 pre-teardown checklist
runpod-deploy manifest-summary scripts/cost_rollup.py Per-run cost capture from runpod_deploy_pull_manifest.json (per ADR-020 cost-reconciliation recipe)
pod.gpu_order + pod.datacenters schema configs/runpod/headline.yaml 8-class GPU failover × 2-DC failover (per ADR-020)
budget.cost_cap_usd + assumed_hourly_rate_usd configs/runpod/headline.yaml Per-job soft cap $125 (= A-002 upper bound; per ADR-020)
preflight.check_gpu_availability (internal; invoked by validate --all) preflight pipeline Pre-spend GPU-stock check across gpu_order × datacenters cross-product
Recipe: flash-attention-fallback src/training/load_backbone.py Cross-GPU-class portability via try/except (ValueError, ImportError) (per ADR-020). v1.1.2 Phase A renamed from load_modernbert.py per ADR-060 carryforward — same recipe; the generic hf_id kwarg now supports both ModernBERT (ADR-019) and DeBERTa-v3-base (ADR-060).
Recipe: cost-reconciliation scripts/cost_rollup.py Post-run actual-vs-assumed reconciliation via runpod_deploy_pull_manifest.json (per ADR-020 dual-layer cost tracking)
events.emit_event (in flash-attn-fallback recipe) src/training/load_backbone.py fallback branch Audit-trail emission when fallback fires

Quarto + GitHub Actions (introduced by ADR-030 + ADR-033)

Quarto is the writeup-rendering engine introduced at Phase 0-07 close per ADR-030 (deliverable format = repo-only with Quarto-rendered HTML site via GH Actions; supersedes ADR-002 PDF + repo). Listed here to preserve the library-first discipline trail; Quarto version pinning is deferred to Phase 0-08 (library version pinning sub-session).

Library / action Invoked in Purpose
quarto (single-binary CLI; system install) Makefile targets site + site-preview Local Quarto site render (quarto render) + live-reload dev server (quarto preview) per ADR-030
quarto-actions/setup@v2 .github/workflows/publish.yml CI Quarto install (per ADR-030 GH Actions hosting lock)
quarto-actions/publish@v2 .github/workflows/publish.yml Auto-publish rendered _site/ to GH Pages via gh-pages branch on push to main and on tag push v* (per ADR-030 + ADR-033 tag-triggers-publish)
_quarto.yml website config repo root Sidebar nav for 8 spokes + auto-include of all ADRs; format: html only (no PDF auxiliary per ADR-030 Q1.b lock)
index.qmd entry-point repo root Reviewer reading-path guide (A1 + A2 + deep-dive paths per ADR-031)

huggingface_hub publication-side use (introduced by ADR-032)

Beyond ADR-013’s persistence-side use of HF Hub (cache + checkpoint storage), ADR-032 introduces publication-side use — pushing the headline rungs to public BBehring/prompt-injection-<rung> model repos with model card discipline.

Primitive Invoked in Purpose
huggingface_hub.HfApi.create_repo scripts/publish_to_hub.py (v1.0.1; idempotent exist_ok=True) Bootstrap BBehring/prompt-injection-<rung> model repos if absent; safe to re-run
huggingface_hub.HfApi.upload_folder scripts/publish_to_hub.py (v1.0.1) + scripts/generate_model_cards.py (v1.0.1 — model-card generator that publish_to_hub.py uploads) Push trained checkpoint + auto-generated model card README to public HF Hub model repo per ADR-032; allow_patterns filters out optimizer.pt/rng/scheduler training state
huggingface_hub.HfApi.whoami scripts/publish_to_hub.py (v1.0.1) Sanity-check authentication; prints the logged-in HF username before any upload to avoid silent wrong-namespace writes
huggingface_hub.ModelCard scripts/generate_model_cards.py (v1.0.1) Library-first model-card template instantiation; project glue is the per-rung metric block + ADR-032 schema population
huggingface_hub.snapshot_download scripts/eval_from_hub.py (Phase 3 deliverable; T0 reproducibility tier per ADR-034) Download a published checkpoint for eval-only reproduction; pin via revision=<SHA> if drift detected per ADR-034 extension condition. Status (v1.0.1): scaffolded but not invoked — non-dry-run body is a v1.1.x deliverable per ADR-051 (carryforward of ADR-034 T0 score-match wiring)
huggingface_hub.HfApi.list_repos tests/test_invariants.py::test_hf_hub_publication_naming_convention (Phase 5 verification stub; v1.0.x carryforward per ADR-051) Verify naming convention BBehring/prompt-injection-<rung-name> for all published rungs

Phase 3 Evaluation deps (introduced incrementally per ADR-045 across Commits 1–6)

Library First imported in Purpose Pinned at
pydantic src/eval/schemas.py (Commit 2) v2 BaseModel contract for PredictionsRowModel + MetricsRecordModel + SliceMetricsModel + OperatingPointModel + CalibrationRecordModel + ReachabilityAuditModel + BootstrapCellModel per ADR-045 Q7 >=2.5 (pyproject.toml)
anthropic src/scoring/anthropic_judge.py (Commit 2) claude-sonnet-4-6 LLM-judge client per ADR-018 line 58; client.messages.create(temperature=0) >=0.40 (pyproject.toml)
transformers.AutoModelForSequenceClassification + AutoTokenizer src/scoring/protectai.py (Commit 2) ProtectAI v1 + v2 inference loaders per ADR-018 line 76 (DeBERTa-v3-base; head-truncation 512; bf16 GPU) >=4.48 (already pinned for Phase 2)
openai.OpenAI src/scoring/openai_judge.py (Commit 2) gpt-4o-2024-08-06 LLM-judge client per ADR-018 line 58; chat.completions.create(temperature=0, response_format=json_object) >=1.50 (already pinned for Phase 1 ADR-042)

Phase 3 scoring entrypoints:

  • src/scoring/protectai.py::ProtectAIScorer(version, revision) — ProtectAI v1+v2 wrapper; reads HF model+tokenizer at pinned SHA; runs in CI smoke with mock model.
  • src/scoring/llm_judge_base.py::LLMJudgeBase — abstract base with cache infra at evals/audit/llm_judge_cache/<judge>__<text_sha256_first_16>.json per A-007 + A-014.
  • src/scoring/openai_judge.py::OpenAIJudge — gpt-4o-2024-08-06 subclass.
  • src/scoring/anthropic_judge.py::AnthropicJudge — claude-sonnet-4-6 subclass.
  • src/scoring/prompts/prompt_template_v1.md — versioned LLM-judge prompt template per ADR-018 line 67.

Phase 4+ inference deps (v1.1.2 Phase B per ADR-060; inventoried at v1.2.0 per ADR-064)

Primitive Imported in Purpose
transformers.PreTrainedTokenizerBase.__call__(return_overflowing_tokens=True, stride=N, padding='max_length') src/inference/windowed.py::chunk_and_average_inference (v1.1.2 Phase B per ADR-060) HF tokenizer’s native sliding-window protocol — emits 512-token windows with stride 256 (50% overlap) per ADR-060 chunk-and-average truncation strategy. No hand-rolled window-stride math; library-first per the project invariant. Each window is forward-passed through the model; per-window softmax_fp32 (per ADR-019) is averaged to produce final predicted_proba_class1.
src.training.softmax_cast.softmax_fp32 (project-internal; not a library import) src/inference/windowed.py::chunk_and_average_inference + head_truncation_inference (v1.1.2 Phase B) Numerical-stability fp32 cast before softmax per ADR-019, applied per-window during the windowed inference path. Mirrors the existing src/training/train_modernbert.py::_predict_proba usage.
src/inference/windowed.py::predict_with_strategy(model, tokenizer, texts, strategy, window_size, stride, per_device_batch_size) (project-internal) scripts/run_deberta_ood_inference.py (v1.1.2 Phase D OOD inference for ADR-060 ablation) Dispatcher: routes 'chunk_and_average'chunk_and_average_inference or 'head_truncation'head_truncation_inference. Rejects unknown strategy with ValueError per ADR-060 lock + no-silent-failures discipline. Used by the DeBERTa OOD inference orchestrator to score 5 OOD slices per strategy with the matching truncation behavior.

Scope rationale (per ADR-060 + ADR-064 §B / Phase 4+ inference deps section): chunk-and-average inference is a project-specific ModernBERT-vs-DeBERTa-v3 confound-control pattern, NOT a generic eval-toolkit primitive (eval-toolkit’s scope is metrics + calibration + bootstrap; not model-inference strategies). No upstream MR filed against eval-toolkit; src/inference/windowed.py stays project-internal.

Phase 1 Data deps (introduced incrementally per ADR-041 across Commits 1–6)

Library First imported in Purpose Pinned at
huggingface_hub scripts/pin_source_manifest.py (Commit 1) HfApi.dataset_info(repo_id).sha for revision SHA discovery per ADR-041 Q2 >=0.25 (pyproject.toml)
pyyaml (graduated dev → main dep) src/data/manifest_validation.py + scripts/pin_source_manifest.py (Commit 1) Manifest YAML parse/serialize per ADR-041 Q1 rich-schema >=6 (pyproject.toml)
datasets src/data/loaders.py (Commit 2) load_dataset(repo, name=subset, split=split, revision=sha) per ADR-041 Q4 HF dispatch >=3.0 (pyproject.toml)
pandas + pandas-stubs src/data/loaders.py (Commit 2) DataFrame uniform schema (text, label, source, row_idx_in_source); parquet IO at Commit 4 >=2.2 (pyproject.toml) + >=2.2 dev
pyarrow src/data/loaders.py (transitive via pandas; Commit 4 parquet) parquet engine >=17 (pyproject.toml)
sentence-transformers src/data/dedup.py (Commit 3) all-MiniLM-L6-v2 embedder per ADR-016 Q4 + THRESHOLD=0.80 locked constant >=3.0 (pyproject.toml)
numpy src/data/dedup.py + scripts/build_dedup_holdout.py (Commit 3) pairwise cosine matrix ops; default_rng(seed) for deterministic sampling >=2.0 (pyproject.toml)
scikit-learn src/data/splits.py (Commit 4) sklearn.model_selection.train_test_split(stratify=y, random_state=seed) for within-fold 80/20 (per ADR-016 Q2; SEEDS={42, 43, 44} per ADR-006) >=1.5 (pyproject.toml)
torch (transitive via sentence-transformers) src/data/dedup.py (Commit 3) encoder backend; CPU-only inference on laptop; flash-attn fallback in Phase 2+ trainer pinned by sentence-transformers transitive
openai scripts/llm_prelabel_dedup_holdout.py (Commit 3 supplement per ADR-042) gpt-4o-2024-08-06 judge for dedup-pair-near-duplicate bootstrap labeling (same snapshot as ADR-018 headline rater); chat.completions.create(temperature=0, response_format=json_object) >=1.50 (pyproject.toml)

Dedup pipeline entrypoints:

  • scripts/build_dedup_holdout.py — Generates 50 stratified-cosine-band candidate pairs from the 4 train-positive sources; writes data/dedup_holdout.jsonl with true_duplicate: null (TBD — hand-labeled by Brandon per ADR-041 Q5).
  • scripts/calibrate_dedup.py — Reads labeled holdout; writes evals/dedup_calibration.json with FPR + FNR at locked threshold 0.80 + sensitivity table at {0.75, 0.80, 0.85}.

Workflow: build_dedup_holdout → hand-label → calibrate_dedup → unskip test_dedup_calibration_persisted.

Pin script entrypoint: scripts/pin_source_manifest.py (Commit 1) — one-time + bump-driven; live-fetches HF SHAs via huggingface_hub.HfApi.dataset_info + GitHub SHAs via subprocess.run(["git", "ls-remote", url, "HEAD"]); writes data/source_manifest.yaml; idempotent re-runs; SHA-mismatch raises SHAMismatchError unless --force records bump_history entry per ADR-036.

Manifest validator entrypoint: src/data/manifest_validation.py::validate_manifest(path) (Commit 1) — invoked from tests/test_invariants.py::test_source_manifest_schema_valid + scripts/pin_source_manifest.py post-write sanity check.

Audit + leakage + contamination pipeline (Commit 5):

  • src/data/audit.py::compute_data_audit(...) — per-source counts + per-fold class balance + length distribution; operationalizes ADR-016 A-005 triggers 2 + 4.
  • src/data/audit.py::compute_leakage_report(splits) — exact-hash + cosine >= 0.85 train+val vs test overlap per (fold, seed); ADR-016 Q3 hard-locked invariant.
  • src/data/audit.py::compute_contamination_scan(benigns, ood, slate, templates) — per-row max cosine to (slate ∪ templates) reference corpus; ADR-016 A-005 trigger 1 + A-006 + ADR-041 Q6.
  • src/data/templates.py::extract_hackaprompt_templates(spec) — ~200 successful-injection templates from HackAPrompt; balanced across 10 difficulty levels; disjoint sample seed (1337) from slate (42).
  • scripts/run_data_pipeline.py — end-to-end orchestrator (load + dedup + split + leakage-cleanup + materialize + audit + leakage + contamination); writes 3 evals/ JSONs + 36 per-fold parquets + 36 index masks.

ADR-043 post-split leakage cleanup:

  • src/data/dedup.py::drop_train_test_leakage(train_val_df, test_df, threshold=0.85) — scans train+val vs test for exact-hash + cosine ≥0.85 overlaps; drops train-side rows; returns cleaned df + per-pair drop records.
  • src/data/splits.py::apply_leakage_cleanup(splits, threshold=0.85) — applies the above to all 12 (fold, seed) splits; re-partitions cleaned train+val at the 80/20 ratio.
  • Wired between make_splits and materialize_splits in scripts/run_data_pipeline.py. Pipeline log records n_dropped per split (exact + cosine breakdown) for audit.

Audit tooling (project-internal; not a methodology primitive)

Per ADR-065 §B3: audit tooling is a meta-level concern (submission-prep / drift defense), NOT a methodology primitive subject to the strengthened library-first invariant (project-local Claude memory memory/library_first_is_project_wide_invariant.md; not committed to the repo). The 4 scan-pattern categories are project-shaped (per-cell.parquet column names; specific ADR slug formats; HF Hub BBehring/prompt-injection-* URL pattern; project-specific dollar-figure context). Logged here for inventory completeness; explicitly tagged audit-tooling-not-primitive.

Script Imported in Purpose Tag
scripts/audit_writeup_numbers.py (~290 LOC; introduced at v1.2.1 per ADR-065 §B) .github/workflows/audit-writeup.yml (CI hard-gate); on-demand local invocation (uv run python scripts/audit_writeup_numbers.py [--report-only]) Programmatic numeric-claim audit on 12 reviewer-facing markdown surfaces; 4 scan categories (numbers + ADR slugs + version strings + URLs); cross-checks against canonical parquets (evals/cost_ledger.csv sum for cumulative cost; future extensions: per_cell.parquet for AUPRC). Configurable --strict default (CI; exit 1 on drift) + --report-only opt-out (local-dev iteration; always exit 0). Pattern after scripts/audit_leakage.py + scripts/audit_reference_scorers.py. audit-tooling-not-primitive
scripts/audit_leakage.py (introduced at v1.0.6) Local invocation + CI leakage job in .github/workflows/ci.yml Verifies evals/leakage_report.json shows leakage_clean=True per ADR-016 + ADR-039 + ADR-043. Wraps upstream eval_toolkit.leakage.CrossSplitLeakageCheck primitive; this script is the project-internal verifier of the persisted artifact. audit-tooling-wrapping-library-primitive
scripts/audit_reference_scorers.py Local invocation + CI Reference-scorer contamination-tier audit per ADR-005 + ADR-018; verifies the 3-state taxonomy in eval-toolkit manifests is consistent with the project’s reference-scorer slate. audit-tooling-not-primitive
scripts/regenerate_audit.py (lifts from Phase 0; checked in CI via make audit-sync-check) Local invocation + CI; --check mode gates pre-commit hook Regenerates SUBMISSION_AUDIT.md from ADR frontmatter (1 CLAIM row per ADR); strict drift-detection in --check mode. The SUBMISSION_AUDIT is derived; ADRs are source of truth (per decisions/README.md lifecycle). audit-tooling-derived-artifact

Scope rationale: project-internal status preserves library-first compliance for methodology primitives (which DO belong in eval-toolkit / runpod-deploy / research_toolkit per the strengthened invariant) while accepting that submission-prep audit tooling is a one-off concern with project-shaped scan patterns. Future portfolio repo work (per memory/portfolio_plan_approved.md) may upstream a generic markdown-drift-scanner primitive if reuse warrants — out of scope for this submission.

research_toolkit usage (https://github.com/brandon-behring/research_toolkit)

The literature dossier at docs/research/ was produced by this toolkit’s skill pipeline. New dossier work invokes the same skills:

Skill / artifact Used in Purpose
/research-plan docs/research//research_plan.md sub-area planning + claim_family taxonomy
/research-gather docs/research//bib_ledger.yml verified primary sources via WebSearch + WebFetch
/dossier-build docs/research//dossier/ topic tables; one row per entry
/agent-index docs/research// 5-bullet-per-entry synthesis + AGENT-INDEX
/dossier-audit docs/research//audit_trail.md DROP/CORRECT/FLAG decisions
/url-freshness-check docs/research//url_check_report.md URL HEAD-check status