Project specification (filled at end of Phase 0)

How to read this page. This is a historical spec-lock sheet. It is useful for auditing what Phase 0 locked and how later phases propagated those decisions. It is not the best first read for the project story; start with README.md, RESULTS.md, or WRITEUP.md instead.

Current state: locked historical spec sheet. The current live site is in the v1.2.x patch series; later patches may clarify presentation, render quality, or status drift without changing the original Phase 0 methodology contract. ADRs remain the source of truth for decision changes.

Type: Single-version SDD spec; revisions tracked through Michael Nygard ADRs.

Companion docs

Current-state orientation

  • The spec was locked before implementation; it intentionally preserves some Phase 0 language and historical status conventions.
  • Later v1.x patches may narrow, supersede, or clarify individual decisions via ADRs and changelog entries. Read the relevant ADR when current and historical wording differ.
  • The page is organized as an audit artifact, not a narrative. The main reader path summarizes the project in plain language before linking here.
Historical Phase-0 context retained for auditability

This submission targets the morning of 2026-05-18 (≈ 2.5 working days from Phase 0-00 start on 2026-05-15), with Long-scope ambition refined by Phase 0-01 + Phase 0-03 + Phase 0-04 + Phase 0-05 + Phase 0-06 + Phase 0-07 + Phase 0-08 (and post-Phase-0-08 final audit per ADR-040 surfacing 7 backfilled assumptions A-010 through A-016) (4-rung trained slate — TF-IDF + LR classical floor per ADR-017 + ModernBERT-base × {frozen-probe, LoRA, full-FT} per ADR-015 — plus 2 reference rungs at their published native configs — protectai/deberta-v3-base-prompt-injection (v1) + protectai/deberta-v3-base-prompt-injection-v2 per ADR-018 (superseded by ADR-050 R1; LLM judges gpt-4o-2024-08-06 + claude-sonnet-4-6 dropped Phase 4 on cost re-estimation per ADR-050 R1) partially supersedes ADR-015 reference slate (Lakera dropped, ProtectAI v1 added) — with 3-seed multi-seed protocol per ADR-006 floor formalized per ADR-022 paired-across-rungs implementation, full OOD slate aggregated per ADR-021 (pooled headline + per-slice spoke), paired-bootstrap apparatus per ADR-006 + ADR-022 with cross-fold CI via eval-toolkit cv_clt_ci (Bayle 2020) headline + block-bootstrap-on-folds spoke ablation per ADR-024, and calibration battery via raw + temperature + isotonic interventions per ADR-023) leveraging runpod-deploy 0.7.7 + eval-toolkit library infrastructure (per ADR-020 — 8-class GPU failover + dual-DC + adaptive batch + dual-layer cost cap; per ADR-022 joblib parallelization on 64-core Threadripper at orchestrator layer), and an explicit fallback ladder updated per ADR-015 (1×3 → 1×2 → 1×1 for transformer rungs; TF-IDF+LR classical floor retained across all fallbacks per ADR-017) that activates if mid-Phase-2 surfaces infeasibility (per ADR-001). The single-backbone refinement eliminates the per-backbone-truncation confound on the indirect-injection zero-shot OOD slice that the original 2-backbone framing would have produced (per ADR-014 Q3/Q4 walk). The full 5-rung OOD slate (2 trained + 2 reference + 1 classical) + 4-rung LODO ladder is stratified along ADR-005’s three-state contamination taxonomy (per ADR-018 + ADR-050 R1) — TF-IDF+LR verified_disjoint anchor + transformer rungs backbone-partial-disjoint + ProtectAI v1/v2 suspected_contamination (vendor_black_box tier empty per ADR-050 R1; 3-tier gradient compressed from the original 4) — making contamination disclosure a methodology axis rather than a footnote. Total: 48 trained runs (4 rungs × 3 seeds × 4 LODO folds; TF-IDF+LR runs are sklearn CPU, transformer runs are H100/equivalent bf16 with per-epoch prediction save) plus 100 prediction parquet files (84 trained + 16 reference) feeding cv_clt_ci on 12 per-(fold, seed) values per rung plus per-row paired-bootstrap on pooled rows. The deliverable is a public GitHub repo rendered as a Quarto-built static HTML site auto-published to GitHub Pages via a quarto-actions/publish@v2 workflow (per ADR-030 supersedes ADR-002 — PDF removed; pandoc + LaTeX dependencies dropped). The site uses an index.qmd entry-point + Quarto sidebar nav declared in _quarto.yml to surface 8 spokes + decisions/ ADRs to a dual A1+A2 audience (hiring manager + ML researcher; per ADR-031 supersedes ADR-004 — A1+A2 + B4 + hub-and-spoke survive; hub artefact shifts from PDF to Quarto site). The submission is governed by three project-level methodology principles (ADR-005): methodology over metrics, honest evaluation preferred even when models look worse, and structured limitations with extension conditions.

  • Locked methodology defaults: process discipline + validated content patterns are [LOCKED] generically; project-specific instantiation details (datasets, rungs, hyperparams, OOD slate, budget) are [OPEN] for Phase 0.
  • Resolved at Phase 0: see decisions/ for ADRs locked during the spec interview.
  • Open at start of Phase 0: see SPEC_GREENFIELD ledger appendix for the ~50 [OPEN] decisions resolved during the interview.

This is an exploration spec for an SDD-disciplined iteration — not a production system, not a paper, not a publishable benchmark. The work is methodology + capability characterization braided: characterize what each capability layer adds, using an evaluation methodology rigorous enough to detect real differences and quantify uncertainty.


1. Goal & non-goals

Goal: deliver a methodology-disciplined characterisation of what successive capability layers (classical TF-IDF + LR floor → frozen ModernBERT-base linear probe → LoRA adapters → full fine-tune) add to prompt-injection detection across a 4-source LODO IID test slate + a 5-slice OOD slate (BIPIA + InjecAgent + JBB-Behaviors + XSTest + NotInject), with bootstrap CIs + paired-bootstrap rung-vs-rung + calibration battery + dual-policy threshold characterisation per ADRs 005-046, published as a Quarto static site + HF Hub model cards + v1.0.0 GitHub release. No rung promoted as a winner; honest unflattering results retained.

Non-goals: - Not optimizing for SOTA PR-AUC. - Not building a deployable service. Deployment is not on the roadmap. - Not creating a publishable benchmark. - [LOCKED: per ADR-005 + ADR-017] Not picking a leader rung — each rung’s trade-offs are characterized, no rung is promoted as the deployment recommendation. The rung-ladder IS the Pareto frontier (per ADR-005 methodology-over-metrics + ADR-017 trained-rung-slate-as-Pareto-instrument framing). - [LOCKED-via-omission] No additional non-goals surfaced during Phase 0-00 through Phase 0-08; the three above + the rung-recommendation non-goal cover the project scope.

Scope authority: the spec itself is the scope cap. Anything not specified here is out of scope. Adding scope post-spec-freeze requires an ADR with explicit “Why this is in scope now” justification.


2. Phases & process gates

the project work is structured into six phases. Each phase has a gate checklist of work-completed and tests-passing — not metric thresholds. The intent is to make movement between phases auditable, not to bind the project’s narrative to specific numerical outcomes.

Phase 0: Spec lock-in interview [LOCKED]

Gate: decisions/ directory contains an ADR for every Phase-0-resolved decision; transcripts/ directory contains the interview transcript.

Phase 1: Data

Gate: every checkbox ticked; tests/test_data.py and tests/test_leakage.py green.

Phase 2: Training

Gate: every checkbox ticked; training manifests schema-validated.

Phase 3: Evaluation

Gate: every checkbox ticked; evals/results.json parses cleanly.

Phase 4: Analysis

Gate: every checkbox ticked; analysis JSON outputs match schemas.

Phase 5: Writeup

Gate: every checkbox ticked; reviewer URLs (source pin at tree/v1.0.0 + live Quarto site + GH release page) all resolve; transcripts ready for private email attachment.


3. Data design

3.1 Train pool composition

[LOCKED: Path α — full source slate (per ADR-016)] — 4 positive sources + 2 benign sources + 5 OOD slices. HarmBench + Tensor Trust + LLMail-Inject deferred to afterword.

Source Approx N (post-dedup, capped) Role License LODO fold
deepset/prompt-injections ~500-650 (use all) Train pos Apache-2.0 1
Lakera/gandalf_ignore_instructions ~800-1000 (use all) Train pos MIT 2
Lakera/mosscap_prompt_injection 3000 (cap) Train pos MIT 3
hackaprompt/hackaprompt-dataset 3000 (cap) Train pos per dataset card 4
lmsys/lmsys-chat-1m 10000 (cap; English-only filter) Train neg CC-BY-4.0 (stratified across folds)
HuggingFaceH4/ultrachat_200k 10000 (cap) Train neg Apache-2.0 (stratified across folds)
leolee99/NotInject 339 OOD hard-neg (over-defense) MIT (never trained)
paul-rottger/xstest 450 OOD hard-neg (over-refusal) per repo (never trained)
JailbreakBench/JBB-Behaviors 200 (100 harmful + 100 benign) OOD mixed MIT (never trained)
microsoft/BIPIA per-task OOD indirect (zero-shot per ADR-014) per repo (never trained)
uiuc-kang-lab/InjecAgent 1054 OOD agentic (stretch probe) per repo (never trained)

Benign subsample ceilings per source: [LOCKED: 3K positives per source for mosscap+HackAPrompt; use-all for deepset+Lakera-gandalf post-dedup; 10K benigns per source for LMSYS+UltraChat; random subsample at seed=42 (per ADR-016)]. Class balance per LODO training pool ≈ 1:2 to 1:2.7 (positives:benigns). Quality-filtered HackAPrompt + attack-type-stratified + length-stratified subsamples deferred to afterword.

3.2 Splits

[LOCKED: LODO k=4 over positive sources + 3 seeds per LODO fold; no internal k-fold (per ADR-016)]. Source-disjoint Leave-One-Dataset-Out at outer level (4 folds, one held-out positive source per fold) + 3 random-initialization seeds = 12 observations per rung. With the 4-rung trained slate locked by ADR-017 + ADR-019 (TF-IDF+LR + ModernBERT × {frozen-probe, LoRA, full-FT}), this is 48 trained runs total (4 rungs × 3 seeds × 4 LODO folds); 12 are sklearn CPU runs (TF-IDF+LR), 36 are H100/equivalent bf16 transformer runs with per-epoch prediction save per ADR-019 (72 transformer prediction files + 12 TF-IDF+LR prediction files = 84 trained-rung prediction parquets). Within each LODO fold: single 80/20 train/val random split (no nested k-fold); val used for threshold selection + calibration fitting + early-stopping per ADR-011 Guarantee 6 (NOT used for hyperparameter tuning per SPEC §2 hyperparameter-immutability). Per-rung bootstrap CIs from 12 observations (10K bootstrap iterations, BCa marginal per ADR-006); rung-vs-rung paired-bootstrap uses (LODO-fold × seed) pairing; MDE on Δ-AUROC ≈ 0.03. Stratified k-fold within LODO (Fomin 2025 / Nadeau-Bengio 2003 variance decomposition; ~5x compute) deferred to afterword.

3.3 Dedup, leakage prevention, cross-source label conflicts

  • Semantic dedup: [LOCKED: sentence-transformers/all-MiniLM-L6-v2 cosine at threshold 0.80; simplified calibration via FPR+FNR on 50-pair labeled holdout persisted to evals/dedup_calibration.json (per ADR-016)]. Label-aware (within (source, label) cells); deterministic first-occurrence retention; cross-label minimal pairs preserved per SPEC_GREENFIELD lock. MPNet-base-v2 + full 4-gate selection rule + cross-encoder reranker deferred to afterword.
  • Cross-source minimal pairs: [LOCKED] preserve-and-flag.
  • Cross-source benign dedup ordering: [LOCKED: within-source-first → cross-source (LMSYS-priority tiebreak) → LODO split (per ADR-016)]. Pipeline: within-source dedup pass per source → cross-source dedup pass (LMSYS-priority on cross-source near-duplicates because LMSYS is real-user data; UltraChat is synthetic) → split into LODO folds with benign stratification.
  • Leakage invariants: tests/test_leakage.py asserts no exact-hash and no high-cosine train-test overlap.
  • Reference-scorer training-overlap audit: [LOCKED] see WRITEUP §3.3 + EVIDENCE.md §1–2.

Truncation policy for inputs > length cap: [LOCKED: adaptive-chunked-max-pool stride=cap//2 at eval time; head-truncation at training time (per ADR-014)]. Training-positives are short so head-truncation rarely bites at train time (HF tokenizer default truncation_side="right"). At eval time, inputs exceeding the cap are split into overlapping chunks of size cap with stride cap // 2 (50 percent overlap so no token sits at a chunk boundary in both chunks); each chunk is scored independently; per-sample score is the max over chunk scores (max-pool aggregation — matches adversarial threat model). Under ADR-015 single-backbone refinement (ModernBERT-base at 8K native), adaptive chunked rarely activates (only on samples exceeding 8K tokens — about 5 percent of BIPIA per dossier estimate). Reference rungs run at their published native configurations including their native truncation policies (ProtectAI head-truncation at 512; Lakera as-API; LLM-judges receive full sample). Mandatory chunked-vs-head ablation on the BIPIA slice lives in WRITEUP/truncation-ablation.md. Phase 1 validation checkpoint: if BIPIA outlier-rate above 8K exceeds 15 percent of the slice, a superseding ADR-016 adjusts chunk-stride or aggregation policy.

3.4 OOD slate

[LOCKED: 5 OOD slices (per ADR-016) reported in two aggregation views (per ADR-021)] — direct over-defense + over-refusal + mixed-direct + indirect zero-shot + agentic-stretch. HarmBench + Tensor Trust + LLMail-Inject deferred to afterword as named next-iteration extensions.

Aggregation layout (per ADR-021): PDF executive headline table carries a single pooled-OOD column per rung (concatenated rows across the 5 slices, single AUPRC + AUROC + recall@FPR + ECE + Brier per rung). Methodology spoke at WRITEUP/ood-analysis.md (new file) carries the 5-by-rung per-slice grid with per-slice bootstrap CIs computed on the same persisted predictions via paired-bootstrap apparatus per ADR-006 + ADR-022 — no extra compute beyond additional metric calls. Pooled-and-per-slice reporting applies ADR-004 hub-and-spoke framing to OOD: pooled for A1 (hiring manager exec scan); per-slice for A2 (ML researcher generalization-question-by-question read). Aligns with Demsar 2006 JMLR multi-dataset reporting guidance.

Slice Source Role Why
NotInject leolee99/NotInject Hard-negative (benign-with-injection-triggers) Tests over-defense per InjecGuard 2024 methodology; explicitly invites worse-but-honest evaluation per ADR-005 Principle 2
XSTest paul-rottger/xstest Hard-negative (over-refusal) Tests exaggerated-safety patterns per Röttger 2024 NAACL
JBB-Behaviors JailbreakBench/JBB-Behaviors Mixed (100 harmful + 100 benign) Standardized misuse-behavior evaluation per Chao 2024 NeurIPS D&B
BIPIA microsoft/BIPIA Indirect (zero-shot OOD per ADR-014 Q1) Indirect-injection benchmark per Yi 2023 KDD; the load-bearing zero-shot transfer measurement
InjecAgent uiuc-kang-lab/InjecAgent Agentic (stretch probe) Tool-integrated agent injection per Zhan 2024 ACL; agentic transfer-of-transfer caveat per ADR-010 Bound 2

Linked ADRs: ADR-014 (threat-model bundle — attack-class scope), ADR-015 (rung architecture — 3 ModernBERT-base trained + 4 reference rungs), ADR-016 (this — data design bundle), ADR-008 (data scope brief-level locks — preserved), ADR-041 (Phase 1 implementation bundle — manifest rich-schema + live-fetch SHA pinning + manifest_validation.py placement + loader dispatch + stratified-cosine-band dedup holdout + slate-plus-templates contamination corpus + per-fold parquet materialization).

3.5 Phase 1 implementation status

[Phase 1 closed per ADR-041] Operationalisation of §3.1–3.4 locks; all 6 commits green. Per-commit status:

Phase 1 commit Deliverable Invariant test Status
Commit 1 configs/data/source_manifest.yaml (live-fetched SHAs; rich schema; bump_history=[]; relocated from data/ per ADR-044 Q2) + src/data/manifest_validation.py + scripts/pin_source_manifest.py test_source_manifest_schema_valid green
Commit 2 src/data/loaders.py (HF dispatch + 11 normalizers) + tests/smoke/test_loaders_smoke.py (3 small HF sources) smoke tests green (3 smoke + dispatch unit)
Commit 3 src/data/dedup.py + scripts/build_dedup_holdout.py + scripts/calibrate_dedup.py + 4 smoke tests; preliminary evals/dedup_calibration.json via ADR-042 LLM-pre-label bootstrap (gpt-4o-2024-08-06; full 4-source coverage; FPR=0.00 FNR=0.33 at locked 0.80; FPR jumps to 0.063 at 0.75 — 0.80 lock at the precision-recall knee) test_dedup_calibration_persisted green green; human_verified_pct=0 pending Brandon’s hand-examination per ADR-042
Commit 4 src/data/splits.py (LODO k=4 x 3 seeds x stratified 80/20) + materialize_splits + materialize_index_masks + 9 smoke tests test_class_balance_per_fold + test_source_disjoint_train_test (unskip in Commit 5 with real data) green
Commit 5 src/data/audit.py + src/data/templates.py + scripts/extract_hackaprompt_templates.py + scripts/run_data_pipeline.py end-to-end orchestrator + ADR-043 post-split leakage cleanup; evals/{data_audit,leakage_report,contamination_scan}.json materialized (4707 deduped positives + 17246 deduped benigns + 1101 OOD; 180 leaked train rows dropped via ADR-043; A-005 triggers 1+2 clean; leakage_clean=True) test_benign_contamination_scan_clean + test_class_balance_per_fold + test_source_disjoint_train_test all green green (5 invariants total)
Commit 6 Makefile Phase 1 targets (data-pin-manifest, data-prepare umbrella, data-fetch/data-dedup/data-splits/data-audit ADR-041-Q7-compat aliases, data-templates, data-dedup-{holdout,prelabel,calibrate}) + docs/ROADMAP.md Phase 1 close note + SUBMISSION_AUDIT regen + transcript checkpoint + push n/a green

3.5.1 Phase 1 library-first carryforward refactor (per ADR-047)

[Phase 1 carryforward refactor closed per ADR-047 at Commit 4 2026-05-16] Triggered by Phase 4 entry walkthrough Q6 user reaffirmation of the library-first invariant as project-wide; retroactive audit identified 4 hand-rolls in src/data/ where eval-toolkit ships fitting primitives. Two upstream contributions filed at audit close: issue #18 (wire 50-pair golden dedup-holdout into eval-toolkit CI fixtures); issue #19 (3-pattern cookbook docs). Each refactor commit deletes orphaned local helpers in-commit per the no-orphaned-code discipline (saved as memory 2026-05-16).

Refactor commit Deliverable Invariants verified Status
Commit 1 (ADR-047 setup) ADR-047 + SPEC_SHEET §3.5.1 + upstream issues #18 + #19 filed + decisions/upstream_issues.md ledger updated + SUBMISSION_AUDIT regen n/a green
Commit 2 (splits refactor) src/data/splits.py::make_splits consumes eval_toolkit.splits.SourceDisjointKFoldSplitter; project glue maps upstream-shuffled fold order back to TRAIN_POSITIVE_SOURCES tuple order (deterministic fold_id-to-source mapping preserved across refactor); per-seed stratified 80/20 train/val + benigns-in-every-train-pool preserved 9 splits smoke tests + 5 invariants (test_class_balance_per_fold + test_source_disjoint_train_test + …) all pass green
Commit 3 (dedup refactor) src/data/dedup.py::{dedup_within_source, drop_train_test_leakage, dedup_cross_source_benigns} consume eval_toolkit.text_dedup.{near_dedup, EmbeddingCosineStrategy(embedder=compute_embeddings), EmbeddingCosineStrategy.pairs_across}; _greedy_first_occurrence_mask deleted in-commit (no remaining callers); pairwise_cosines retained pending Commit 4 (still has callers in audit.py + build_dedup_holdout.py + test); project-owned embedder glue (get_encoder + compute_embeddings + encoder_revision_sha) preserved; compute_embeddings signature broadened from list[str] to Sequence[str] for upstream Callable[[Sequence[str]], ndarray] Protocol compat (non-breaking — all callers pass list) 4 dedup smoke tests pass (including test_dedup_cross_source_lmsys_priority priority-source reason preservation); 123/123 smoke total + 10 invariants pass; mypy + ruff green green
Commit 4 (audit refactor + close) src/data/audit.py::compute_leakage_report consumes run_leakage_checks([CrossSplitLeakageCheck]) per fold (ExactDuplicateCheck + NearDuplicateCheck dropped per implementation note — they would always report zero findings post-dedup_within_source); compute_contamination_scan consumes EmbeddingCosineStrategy.pairs_across(query, reference, k=1) + project per-source aggregation glue; project-dict output schemas preserved for both. _per_row_max_cosine_to_ref (audit.py local helper) deleted in-commit. pairwise_cosines (dedup.py) deleted in-commit (now truly orphaned after audit.py + build_dedup_holdout.py refactors away from it). test_pairwise_cosines_symmetric (tested deleted primitive) deleted in-commit. scripts/build_dedup_holdout.py::_enumerate_within_source_pairs refactored to use EmbeddingCosineStrategy.pairs_within(texts, n-1) so the script’s pairwise_cosines import dependency is severed. Output schema for evals/leakage_report.json preserved (CrossSplitLeakageCheck count maps to existing cosine_ge_085_overlaps field) — no schema migration needed 6 audit+dedup smoke tests pass (test_compute_data_audit_yields_per_source_counts + test_compute_leakage_report_zero_overlaps_on_disjoint_splits + test_compute_contamination_scan_unrelated_benigns_clean + test_compute_embeddings_shape_and_norm + test_dedup_within_source_drops_near_duplicates + test_dedup_cross_source_lmsys_priority); 122/122 smoke total (was 123; -1 from deleted test_pairwise_cosines_symmetric) + 10 invariants pass; mypy + ruff green green

Phase 1 library-first carryforward refactor CLOSED at Commit 4. ADR-046 (Phase 4 implementation bundle per prior 7-question ratification) writing unblocked; Phase 4 Commit 1 begins after ADR-046 lands.

3.6 Phase 2 implementation status

[Phase 2 closed per ADR-044] Operationalisation of §4 locks; all 6 commits green. Per-commit status:

Phase 2 commit Deliverable Invariant test Status
Commit 1 ADR-044 (Phase 2 implementation bundle; partial supersession of ADR-019 seed slate (42,1337,2025)→(42,43,44)) + manifest move data/configs/data/ per Q2 + 10-file path-ref update test_source_manifest_schema_valid (still green at new path) green
Commit 2 src/training/{batch_table, lora_config, training_args, weighted_trainer, load_modernbert, softmax_cast}.py per ADR-019 + ADR-020 + 18 smoke tests test_flash_attn_fallback_present + test_effective_batch_constant_across_gpu_classes green green (7 invariants total)
Commit 3 src/training/{tfidf_lr, train_classical}.py per ADR-017 + configs/rungs/classical_floor.yaml + scripts/train_classical_floor.py + 5 smoke tests test_classical_floor_rung_present green green (8 invariants total)
Commit 4 src/training/train_modernbert.py multi-rung HF Trainer dispatch (frozen_probe + lora + full_ft via classifier_type) + configs/rungs/{frozen_probe, lora, full_ft}.yaml (ModernBERT-base SHA pinned at 8949b909) + PerEpochPredictionsCallback per ADR-019 + 10 smoke tests test_per_epoch_predictions_present (deferred to canonical run; needs GPU) green (8 invariants total; per-epoch invariant deferred)
Commit 5 configs/runpod/headline-{frozen_probe, lora, full_ft}.yaml (runpod-deploy schema_version 2 — H100/H200/A100/L40S failover; cost caps $40/$60/$100) + scripts/train_rung.py per-rung sweep + scripts/cost_rollup.py aggregator + 8 smoke tests n/a (cloud runs at canonical) green (code lands; runs deferred to canonical)
Commit 6 tests/fixtures/processed/fold-0/seed-42/*.parquet (100/24/24 rows; 12KB total; reproducible via scripts/generate_fixtures.py at seed=1337) + configs/profiles/classical_fixtures.yaml + tests/smoke/test_smoke_pipeline.py (3 tests; fixture-pipeline + idempotency) + Makefile Phase 2 targets (generate-fixtures, train-classical-floor, train-rung RUNG=<...>, cost-rollup, cost-rollup-check, headline-{frozen-probe,lora,full-ft}) + make smoke extended to fixture-pipeline pass per ADR-027 line 75 + docs/ROADMAP.md Phase 2 close note n/a green

3.7 Phase 3 implementation status

[Phase 3 closed per ADR-045] Operationalisation of §5 locks; all 6 commits green. Per-commit status:

Phase 3 commit Deliverable Invariant test Status
Commit 1 ADR-045 (Phase 3 implementation bundle; scoring-first contract + 6-commit cadence + tiered ref-scorers + classical-scaffold + full-pairwise persistence with headline-only WRITEUP + pydantic schema validation) + SPEC_SHEET §3.7 status table + SUBMISSION_AUDIT regen n/a green
Commit 2 src/scoring/{protectai, llm_judge_base, openai_judge, anthropic_judge}.py per ADR-018 + src/eval/schemas.py (pydantic models — PredictionsRowModel, MetricsRecordModel, SliceMetricsModel, OperatingPointModel, CalibrationRecordModel, ReachabilityAuditModel, BootstrapCellModel) + versioned prompt template at src/scoring/prompts/prompt_template_v1.md + Tier-A (ProtectAI) CI smoke + Tier-B (LLM judges) cache infrastructure at evals/audit/llm_judge_cache/<judge>__<sha256-prefix>.json per A-007 + A-014 + 22 smoke tests test_reference_scorer_schema_uniform green green (9 invariants total)
Commit 3 src/eval/calibration_battery.py per ADR-023 (eval-toolkit ECE 4-variant matrix expected_calibration_error{,_debiased,_l2,_l2_debiased} + expected_calibration_error_equal_mass headline at n_bins=15 + brier_score + brier_decomposition reliability/resolution/uncertainty + fit_temperature + fit_isotonic_calibrator + reliability_curve; validation-only fit per ADR-011 Guarantee 6; proba_to_logprobs + apply_temperature helpers for binary-to-2-col-logit conversion) + 12 smoke tests test_calibration_battery_outputs_4ece_plus_brier green green (10 invariants total)
Commit 4 src/eval/operating_points.py per ADR-025 (TargetFPRSelector(0.01) detection + TargetRecallSelector(0.99) verification per-(rung, fold, seed) val fit; fit_operating_point + fit_dual_policy_for_cell + compute_reachability_audit per A-009) + src/eval/slice_analysis.py per ADR-021 (5-slice OOD slate compute_metric_record + pooled-headline compute_pooled_ood_record + per-slice spoke aggregate_slice_across_observations + 0.1% pinpoint volatility surfaces compute_pinpoint_volatility per ADR-021 line 53-65) + 20 smoke tests module-level smoke tests cover contract (test_dual_policy_threshold_pairing + test_verification_reachability_audit + test_ood_aggregation_layout + test_recall_at_fpr_pinpoint_volatility are integration-level invariants deferred to Commit 5 when scripts wire end-to-end) green (10 invariants total; 4 stubs deferred to Commit 5)
Commit 5 scripts/run_metrics_battery.py (loads predictions parquets per rung × fold × seed × slice; emits MetricsRecordModel + pooled-OOD records via src/eval/slice_analysis.py) + scripts/fit_dual_policy_thresholds.py (sweeps trained-rung × fold × seed; reference scorers filtered via TRAINED_RUNGS allowlist per SPEC §4; emits OperatingPointModel + ReachabilityAuditModel nested-JSON per A-009) + scripts/run_bootstrap_battery.py (full-pairwise C(rungs, 2) × slices × metrics via eval_toolkit.bootstrap.paired_bootstrap_diff; persists BootstrapCellModel per Q6 user refinement so post-hoc questions answer from disk; WRITEUP features the 3 headline comparisons) + scripts/eval_from_hub.py T0-tier dry-run surface per ADR-034 (full body gated on Phase 5 ADR-032 publication) + 5 subprocess-based smoke tests covering all 4 entrypoints smoke covers contract; integration invariants (test_dual_policy_threshold_pairing + test_verification_reachability_audit + test_ood_aggregation_layout + test_recall_at_fpr_pinpoint_volatility + test_bootstrap_n_and_stability_check + test_paired_across_rungs_pairing + test_cross_fold_ci_methodology) remain skip-marked pending Phase 4 canonical evals run on full 84-parquet trained-rung output green (10 invariants total; 7 integration stubs deferred to Phase 4)
Commit 6 Makefile Phase 3 targets (eval-classical-floor, eval-reference-scorers-free Tier-A scaffold, eval-reference-scorers-paid Tier-B with interactive approval per ADR-045 Q4, metrics-battery, dual-policy-thresholds, bootstrap-battery, eval-from-hub Phase-3-wired) + make smoke extension (now includes run_metrics_battery.py end-to-end pass on classical-floor fixture predictions + eval_from_hub.py --dry-run per ADR-027 sub-10-min budget) + tests/fixtures/metrics/ gitignored + docs/ROADMAP.md Phase 3 close note + Phase 4 unblock n/a green

3.8 Phase 4 implementation status

[Phase 4 closed per ADR-046] Operationalisation of §5 plus ADR-006 + ADR-022 + ADR-024 + ADR-025 (plus partial supersession of ROADMAP TBD-at-Phase-4 reference-scorer-audit-deferred framing per ADR-046 Q5 user override → include-now-locked); all 6 commits green. Reference-scorer slate further narrowed by ADR-050 at Phase 4-5 transition (LLM judges dropped on cost; full-FT OOD dropped on FUSE crash). Per-commit status:

Phase 4 commit Deliverable Invariant test Status
Commit 1 ADR-046 (Phase 4 implementation bundle; 6-commit cadence + scaffold-with-classical + always-emit-both-CIs auto-flag + MDE-on-every-emitted-CI + LLM-rater audit included per user override + library-first hybrid figures per project-wide invariant codification + Phase 5 prep deferred) + SPEC_SHEET §3.8 status table + SUBMISSION_AUDIT regen n/a green
Commit 2 src/eval/marginal_bootstrap.py per ADR-022 (bootstrap_ci wrappers; 10K @ seed=1 headline + 10K @ seed=2 stability check) + src/eval/cross_fold_ci.py (cv_clt_ci headline per ADR-024; block-bootstrap spoke fields scaffolded as None pending Commit 3) + src/eval/mde.py per ADR-006 (mde_from_paired_ci_record direct wrap + mde_from_marginal_ci_record closed-form workaround per upstream issue #20) + 3 new pydantic schemas (MarginalBootstrapCellModel + CrossFoldCIModel + MDECellModel) + 18 smoke tests test_marginal_bootstrap_seed_stability + test_cv_clt_ci_headline_present (deferred-unskip at canonical evals run; 44 total tests collect cleanly) green
Commit 3 src/eval/cross_fold_ci.py extension — compute_block_bootstrap_on_folds (inline NumPy workaround per upstream issue #21; vectorized resample of K folds with replacement; percentile CI per ADR-022) + compute_a_008_flag (strict > 1.5 per A_008_RATIO_THRESHOLD; degenerate-cv_clt edge case handled) + compute_cross_fold_ci_cell always populates both cv_clt + block fields + the boolean flag; 10 new smoke tests cover halfwidth ordering + seed determinism + flag rule + threshold constant test_block_bootstrap_folds_spoke_present + test_a_008_flag_fired_when_ratio_exceeds_1_5 (deferred-unskip at canonical evals run; 46 total tests collect) green
Commit 4 src/eval/figures.py per Q6 — library-first hybrid 7-figure slate; consumes eval_toolkit.plotting.{plot_pr_curve, plot_reliability_diagram, plot_bootstrap_distribution, plot_lift_ci, save_figure, set_plot_style, PALETTE} for F3 + F4 + F6-right + F7-subpanels + project glue for F1 Pareto + F2 ROC + F5 heatmap + F6-left + F7 grid layout (cites upstream issues #14 + #15 + #16 + new #22 plot_metric_bars ax kwarg for F6-left as TODOs); SVG output via save_figure writes a {stem}.meta.json sidecar carrying provenance per ADR-030; 14 smoke tests pass headless via matplotlib.use("Agg"); matplotlib graduated to main deps from notebook extras test_figures_slate_7_svgs_present + test_save_figure_provenance_chunks_present (deferred-unskip when Commit 5 orchestrates the canonical slate; 48 total tests collect) green
Commit 5 Orchestration scripts — scripts/run_marginal_bootstrap.py per Q4 (sweeps marginal cells x both seeds per ADR-022) + scripts/run_cv_clt_ci.py per Q3 (sweeps both cv_clt + block fields + a_008 flag) + scripts/run_mde.py per Q4 (aggregates MDE across paired + marginal + cv_clt + block cells via closed-form path; emits evals/audit/mde_per_cell.parquet) + scripts/render_figures.py per Q6 (canonical + scaffold paths; emits docs/plots/F{1..7}.svg + per-figure .meta.json provenance sidecars) + scripts/audit_reference_scorers.py per Q5 user override (samples disagreement pairs vs trained rung, interactive approval gate per ADR-020 + --dry-run cost preview + --assume-yes for CI; uses OpenAIJudge from Phase 3 Commit 2 with locked OPENAI_JUDGE_MODEL per ADR-018) + 5 subprocess-based smoke tests smoke covers contract; canonical-data invariants deferred to operator-gated runs green
Commit 6 Makefile Phase 4 targets (marginal-bootstrap, cv-clt-ci, mde-battery, render-figures, audit-reference-scorers, phase4-all umbrella) + extended make smoke (now also runs scripts/render_figures.py --scaffold writing 7 SVG + sidecars to tests/fixtures/plots/; under ADR-027 sub-10-min budget) + tests/fixtures/plots/ gitignored + docs/ROADMAP.md Phase 4 status + close note + 6-step v0.9.0-rc1 rehearsal-tag dispatch checklist per ADR-033 + Phase 5 (Writeup) unblock n/a green

After Commit 6 lands + invariants pass, v0.9.0-rc1 rehearsal tag fires triggering the full publish pipeline (Quarto site build per ADR-030 + GH Pages deploy + HF Hub model card pushes per ADR-032) as a 24+ hour dress-rehearsal per ADR-033 + ADR-038. Phase 5 (Writeup) begins after.


4. Model recipe (locked, no gridsearch)

Each rung is locked before training begins. No val-set hyperparameter gridsearch.

4.1 Rung 1 — classical floor (TF-IDF + LR)

[LOCKED: sklearn TF-IDF + LogisticRegression (per ADR-017)] — Combined sparse features via FeatureUnion: word 1-2-grams (max_features=15000, sublinear_tf=True, lowercase=True, strip_accents=unicode) + char 3-5-grams (max_features=15000); concatenated → up to 30K-dim sparse matrix. Classifier: LogisticRegression(solver='liblinear', C=1.0, class_weight='balanced', max_iter=1000) — fit-to-convergence; no epoch concept; deterministic per seed (ADR-006 slate: 42, 1337, 2025). 3 seeds × 4 LODO folds = 12 sklearn CPU runs. Contamination state: verified_disjoint (trained on our LODO splits by construction).

4.2 Rung 2 — frozen-features probe

[LOCKED: ModernBERT-base frozen-probe (per ADR-015 + ADR-019)] — Transformer body frozen; linear classifier head (2-class) trained on [CLS]-pooled embeddings via WeightedTrainer subclass (CrossEntropyLoss with per-fold sklearn class_weight='balanced' tensor; per ADR-019). bf16=True with fp32 cast before final softmax. 2 epochs; cosine LR schedule with 10% warmup; lr=1e-4. Per-epoch checkpoint + per-epoch parquet predictions persisted. Dual role per ADR-017: candidate detector in headline table AND diagnostic anchor in methodology spoke. Contamination state: backbone-partial-disjoint (fine-tuning disjoint by LODO; backbone pretrain corpus may overlap eval sources).

4.3 Rung 3 — LoRA adapter-fine-tuned

[LOCKED: ModernBERT-base LoRA (per ADR-015 + ADR-019)] — PEFT-LoRA adapters; backbone frozen; classifier head full-FT via modules_to_save=["classifier"]. Locked recipe (per ADR-019): LoraConfig(r=8, lora_alpha=16, lora_dropout=0.1, target_modules=["Wqkv", "attn.Wo", "mlp.Wo", "mlp.Wi"], task_type="SEQ_CLS", bias="none") — explicit module enumeration (4 LoRA modules per encoder × 22 layers = 88 adapter modules), not "all-linear" auto-detection. TrainingArguments: lr=1e-4, warmup_ratio=0.10, lr_scheduler_type=cosine, per_device_train_batch_size=16 + gradient_accumulation_steps=2 (effective batch 32; ADR-020 BATCH_TABLE scales for non-H100 classes), num_train_epochs=2, bf16=True, max_grad_norm=1.0, weight_decay=0.01, save_strategy=“epoch”, eval_strategy=“no”. DataCollatorWithPadding(max_length=8192, pad_to_multiple_of=8) — dynamic padding, head-truncation per ADR-014 Q4 training-time. Per-fold sklearn class_weight='balanced' via WeightedTrainer. Contamination state: backbone-partial-disjoint.

4.4 Rung 4 — full-FT trained backbone

[LOCKED: ModernBERT-base full-FT (per ADR-015 + ADR-019)] — Full backbone parameters trainable; standard HF Trainer + eval-toolkit metric callbacks + WeightedTrainer subclass for class-weighted CE. Same recipe as Rung 3 (lr=1e-4, 2 epochs, bf16, effective batch 32, etc.). Intermediate (epoch-1) weight checkpoints not persisted to disk (~1.8 GB throwaway across 12 runs); per-row predictions for epoch-1 are saved without the underlying weights since predictions are the audit-relevant artifact. Final epoch checkpoint is persisted per ADR-013 pre-teardown checklist. Contamination state: backbone-partial-disjoint.

4.5 Reference rungs — 2 published baselines at native config (post-ADR-050 narrowing)

[LOCKED: 2 reference rungs (per ADR-018 → ADR-050 narrow supersession; LLM judges dropped on Phase 4 cost re-estimation, ~16× envelope overrun)] — Lakera Guard dropped at Phase 0-03 per ADR-018 (afterword extension). LLM judges (gpt-4o-2024-08-06 + claude-sonnet-4-6) dropped at Phase 4 per ADR-050. The vendor_black_box contamination tier therefore carries 0 rungs in this submission; the contamination-stratification gradient compresses from 4 tiers to 3 (verified_disjoint + backbone-partial-disjoint + suspected_contamination). ProtectAI v1 + v2 + TF-IDF+LR remain as the 3-rung reference slate.

  • R-ProtectAI-v1: protectai/deberta-v3-base-prompt-injection (HF revision SHA-pinned at Phase 1 per ADR-016 manifest); inference-only at native config (head-truncation at 512); bf16 on GPU. Contamination state: suspected_contamination.
  • R-ProtectAI-v2: protectai/deberta-v3-base-prompt-injection-v2 (HF revision SHA-pinned at Phase 1); inference-only at native config (head-truncation at 512); bf16 on GPU. Contamination state: suspected_contamination.

Each reference rung is called at its published native configuration including its native truncation policy. Apples-to-apples comparison against deployed baselines requires testing them as they exist, not as preprocessed by us. Training-data overlap audit per EVIDENCE.md §1-2. The methodology spoke includes a dedicated Contamination stratification subsection narrating the three-tier disclosure gradient (verified_disjointbackbone-partial-disjointsuspected_contamination); the trained-rung-vs-reference comparison is framed as “what trained-from-scratch (TF-IDF+LR verified_disjoint anchor) achieves versus what potentially-memorized off-the-shelf models achieve.”

LODO comparison: 3-rung trained ladder (frozen-probe + LoRA + full-FT) retained per ADR-050 Revision 2 (full-FT LODO predictions survived Phase 2). OOD comparison: 2-rung trained (frozen-probe + LoRA) + classical floor (tfidf-lr) + 2 reference scorers (ProtectAI v1 + v2) = 5-rung OOD slate. full-FT OOD inference dropped at X11 FUSE EIO crash per ADR-050.

4.6 Per-epoch prediction-save discipline

[LOCKED: epoch-2 headline, epoch-1 diagnostic (per ADR-019)] — Per-row predictions persisted for every transformer (rung, seed, fold, epoch) combination → 72 transformer prediction parquets + 12 TF-IDF+LR (no-epoch) + 16 reference rungs = 100 total prediction files. File-path convention: evals/predictions/<rung>__fold<F>__seed<S>__epoch<N>.parquet. Discipline rule pre-committed: epoch-2 predictions are the publication number; epoch-1 predictions are reported as a diagnostic ablation in the methodology spoke (the per-(rung, seed, fold) epoch-1→epoch-2 AUPRC delta plot surfaces undertraining-vs-overfitting boundaries).

4.7 Matched-budget controls

[LOCKED: per-axis (per ADR-018)] — Match data (same train/eval splits per ADR-016) + eval methodology (same metrics, same statistical machinery per ADR-006); do NOT match training compute. Each rung uses its natural recipe; training compute is reported alongside the metric so AUPRC-vs-compute can be plotted as a Pareto frontier — the rung-ladder IS the Pareto frontier. Per-axis matching is the only framing that coherently handles the heterogeneous cost classes (LLM-judge $/call, trained rungs GPU-minutes, ProtectAI inference-only). Documented as a dedicated Matched-budget framing subsection in the methodology spoke.

4.8 Compute infrastructure (per ADR-020)

[LOCKED: runpod-deploy 0.7.7 with 8-class GPU failover, dual-DC, adaptive batch, dual-layer cost cap] — - pod.gpu_order (priority): H100 80GB HBM3 → H100 NVL → H100 SXM → H100 PCIe → H200 → H200 NVL → A100-SXM4-80GB → A100 80GB PCIe → L40S → A100-SXM4-40GB (emergency) - pod.datacenters: [US-MD-1, EU-RO-1] (dual-DC failover) - BATCH_TABLE (preserves effective batch = 32 across GPU classes): H100/H200/A100-80G use (per_device=16, grad_accum=2); A100-40G/L40S use (8, 4); L40 uses (4, 8). Pre-locked lookup keyed on torch.cuda.get_device_name; fail-loud on unlisted GPU. - flash_attention_2 fallback per runpod-deploy recipe: try/except (ValueError, ImportError) around model load → degrades to stock SDPA on smaller classes; events.emit_event("flash_attn_fallback", ...) for audit. - Cost cap (dual-layer): per-job budget.cost_cap_usd=125.0 (orchestrator-enforced; = A-002 upper-bound soft cap) + project-wide hard cap $200 enforced by scripts/cost_rollup.py CI-gated check aggregating across all per-pod runpod_deploy_pull_manifest.json files + API call logs. - assumed_hourly_rate_usd=3.50 (H100 spot midpoint; reconciled post-first-run per cost-reconciliation recipe). - Preflight discipline: runpod-deploy validate --all + runpod-deploy run --dry-run before any billed run. - Cost tracking (dual-layer): per-pod automatic via runpod_deploy_pull_manifest.json + per-Makefile-target rollup in evals/cost_ledger.csv (cols: timestamp, target, est_cost_usd, actual_cost_usd, gpu_hours, api_calls, notes).

4.9 Future-work extensions (afterword)

[LOCKED: NONE in primary slate; future-work extensions named per ADR-015 + ADR-017 + ADR-018 + ADR-019 alternatives] — ModernBERT-large size-up, matched-context cross-backbone control, alternate classification head (MLP), calibration via validation-fit temperature, Lakera Guard re-addition (ToS-permitting), frontier-tier judge ablation (gpt-4.1 / opus-4-7), reasoning-judge ablation (o1/o3), multi-judge ensemble, rank ablation (r=4/r=16/r=32), target-module ablation (Q+V vs all-linear), DoRA / rs-LoRA / VeRA, 1-epoch-locked schedule comparison, 3-epoch convergence study, focal loss vs class-weighting, per-source learning-curve decomposition, hashing vectorizer for long docs, calibrated LR via CalibratedClassifierCV. Calibration is a separate methodology axis (Phase 0-04 walks the calibration battery, ledger row 343).

Linked ADRs: ADR-015 + ADR-017 + ADR-018 + ADR-019 + ADR-020 (compute + cost discipline).


5. Eval design

5.1 Primary descriptive metrics

[LOCKED: PR-AUC + ROC-AUC + recall@FPR={0.1pct-pooled-only, 1pct, 5pct} + ECE-equal-mass(n_bins=15, quantile) + Brier on raw scores per rung (per ADR-021 + ADR-023)]. All reported with bootstrap CIs per ADR-022 + ADR-024 (cv_clt_ci on 12 (fold, seed) per-rung values for rank-based metrics; pool-rows-and-compute-once for per-row metrics; 10K @ seed=1 + 10K @ seed=2 stability check; >5% half-width flag).

Dual-policy operating-point columns (per ADR-025) — trained rungs only — gain one new headline column “FPR @ recall ≥ 99%” (verification policy operating point via TargetRecallSelector(0.99) on val); the existing R@FPR=1% column carries a footnote tagging it as the detection policy operating point via TargetFPRSelector(0.01) on val. Headline footprint per trained rung settles at: AUPRC | AUROC | R@FPR=0.1%* | R@FPR=1%† | R@FPR=5% | FPR@R≥99%† | ECE | Brier (* = ADR-021 0.1%-pooled-only volatility flag; † = dual-policy operating points). Reference rungs receive blank cells in the verification column with footnote pointing to the SPEC §4 dual-policy applicability lock (only trained rungs get dual-policy framing; reference scorers report recall@FPR pinpoints only with contamination caveats per ADR-018).

Recall@FPR pinpoint volatility surfacing (per ADR-021) — for the 0.1% pinpoint at pooled level: half-width column alongside point estimate; flag marker when half-width > 0.5 × point estimate; resample-degeneracy fraction emitted to evals/audit/per_rung_audit.json; per-resample threshold-drift dump to evals/audit/pinpoint_threshold_drift.json; methodology spoke explains why 0.1% reports wider CIs and is not computable per-slice. The 0.1% pinpoint is reported only at the pooled aggregation level (pooled n_neg ≈ 16-20K yielding 16-20 FPs at threshold); at per-slice or per-LODO-fold aggregation it is reported as “not computable at this aggregation level (n_neg too small)”.

Calibration battery composition (per ADR-023) — Headline: ECE-equal-mass(n_bins=15, quantile binning) + Brier on raw scores per rung. Spoke (WRITEUP/calibration.md): all 4 ECE variants from eval-toolkit (L1/L2 × plug-in/debiased) + Brier decomposition (refinement / reliability / uncertainty) + reliability diagrams (equal-mass quantile) + intervention deltas — temperature scaling (Guo 2017 1-parameter) + isotonic regression (non-parametric monotonic remapping); both calibrators fit on validation only per-(rung, fold, seed) per ADR-011 Guarantee 6; calibration interventions are monotonic and therefore do NOT change rank-based headline metrics (PR-AUC, ROC-AUC, recall@FPR).

5.2 Statistical tests

Stance: report effect sizes and CIs only. No p-values. The work characterises differences and their uncertainty rather than claiming significance.

Anchored to eval-toolkit primitives:

  • bootstrap_ci — per-metric finite-sample uncertainty. See eval-toolkit bootstrap methodology (see README).
  • paired_bootstrap_diff — paired comparisons across rungs on the same test set. See eval-toolkit comparison methodology (see README).
  • mde_from_ci — minimum detectable effect.
  • Calibration battery (reliability_curve, fit_temperature, fit_isotonic_calibrator, ECE variants, Brier). See eval-toolkit calibration methodology (see README).
  • cv_clt_ci — CLT-based CI for cross-fold variance.

Cross-fold CI methodology: [LOCKED: cv_clt_ci (Bayle 2020 Annals of Statistics Theorem 3.1 implementation at eval-toolkit src/eval_toolkit/bootstrap.py:963) headline + block-bootstrap-on-folds spoke ablation + conditional stratified-k-fold-within-LODO escalation if Phase 4 compute budget permits (per ADR-024)]. cv_clt_ci operates on the 12 per-(fold, seed) metric values yielded by ADR-022’s compute-per-(fold, seed)-then-aggregate rule for rank-based metrics. Block-bootstrap-on-folds spoke ablation directly addresses the LODO non-exchangeability concern (folds are not exchangeable — each fold holds out a different positive source with different size and attack-style character). Sensitivity-check flag: if block_bootstrap_CI_halfwidth / cv_clt_CI_halfwidth > 1.5 for any rung, methodology spoke names “LODO non-exchangeability dominates within-fold variance; headline CI may understate uncertainty” (assumption A-008). Bates 2024 JASA nested-CV + Nadeau-Bengio 2003 standalone correction explicitly deferred to afterword.

Paired-test method: [LOCKED: eval-toolkit paired_bootstrap_diff (Efron-Tibshirani 1993 §10.3 row-level pairing) per ADR-022; DeLong + McNemar + Cochran-Q rejected at the row level with multi-source-LODO-specific rationale (DeLong's asymptotic Gaussian assumption breaks at per-fold n ≈ 4-5K benigns; designed for AUROC only; produces p-value contradicting estimation-over-testing; LODO fold-blocking violates iid assumption)].

Multi-seed protocol (per ADR-022 + ADR-006 + ADR-016): [LOCKED: 3 seeds {42, 1337, 2025} paired across rungs; trained rungs 12 obs per rung (4 LODO folds × 3 seeds); reference rungs 4 obs per rung (4 folds × no seed dimension); trained-vs-trained pairing is row-level via paired_bootstrap_diff; trained-vs-reference pairing replicates reference scores across the 12 trained seeds (reference-side variance fold-only); rank-based metrics per-(fold, seed)-then-mean; per-row metrics pool rows across (fold, seed); recall@FPR thresholds per-(seed) from val; calibration interventions per-(rung, fold, seed); per-(rung, fold, seed) observations dumped to evals/audit/per_seed_observations.parquet per ADR-011 Guarantee 5].

Multi-comparison correction (per ADR-022 + ADR-006): [LOCKED: no formal correction applied; methodology spoke at WRITEUP/methodology.md gains "Family of comparisons" acknowledgment paragraph citing Gelman & Loken 2014 forking-paths + ASA 2016 statement on p-values]. Estimation-over-testing means correction does not apply (correction applies to significance-testing; we report effect sizes).

5.3 Operating points — detection vs verification

[LOCKED] Dual-policy framing on in-house rungs only. Reference scorers (off-the-shelf reference detectors) get recall@FPR pinpoints with explicit contamination caveats; no dual-policy framing (would imply deployment-ready operating points that don’t survive the contamination caveat).

[LOCKED: Detection — FPR ≤ 1% via eval_toolkit.TargetFPRSelector(0.01); Verification — FNR ≤ 1% (equivalently recall ≥ 99%) via eval_toolkit.TargetRecallSelector(0.99); per-(rung, fold, seed) fitting on validation only; 24 thresholds per trained rung × 4 trained rungs = 96 threshold-pair instances; paired_bootstrap_op_point_diff two-level bootstrap (refit per resample) for CI propagation; cost-weighted thresholding remains rejected per ADR-006 (no CostSensitiveSelector use); per ADR-025].

Headline integration: detection-policy operating point coincides numerically with the recall@FPR=1% headline pinpoint already locked in ADR-021 — captured as a footnote on the existing R@FPR=1% column. Verification-policy operating point gains one new headline column “FPR @ recall ≥ 99%” per trained rung (see §5.1).

Spoke: full dual-policy operating-point grid (4 trained rungs × 2 policies × {pooled-IID + pooled-OOD + 4 per-LODO-fold + 5 per-OOD-slice} aggregation levels = 80 cells per policy with paired_bootstrap_op_point_diff CIs) + Verification-target reachability across trained rungs subsection (per assumption A-009; honest infeasibility reporting via asterisk + audit JSON evals/audit/verification_reachability.json) + ≥3 deployment scenarios per ADR-006 + optional Recall-floor sensitivity sweep afterword regenerating verification operating points at recall floors {95%, 99%, 99.9%} from persisted predictions per ADR-013 (zero new training compute) — all in WRITEUP/threshold-policy.md. See eval-toolkit thresholds methodology (see README) for the eval-toolkit primitive surface.

5.4 Per-source and per-style breakdowns

Required for any OOD claim — aggregate metrics hide heterogeneity. Reported alongside the headline IID/OOD numbers. Per-style heuristic tagger (regex-based) is conservative; LLM-as-rater rubric audit dropped per ADR-050 (Phase 4 cost re-estimation showed envelope ~16× original ADR-018 estimate; see EVIDENCE.md §3).

5.5 Adversarial robustness

Largely deferred — named but not exhaustively probed. The threat model (paraphrase, encoded payloads, multi-turn injection, base64/leetspeak obfuscation) is named per ADR-014; what was not tested is named explicitly in WRITEUP §5.6 and §8.

Linked ADRs: ADR-021 (eval slate aggregation + recall@FPR pinpoints), ADR-022 (statistical inference apparatus — bootstrap N + multi-comparison + multi-seed + paired-test), ADR-023 (calibration battery — raw + temperature + isotonic), ADR-024 (cross-fold CI methodology — cv_clt_ci headline + block-bootstrap-on-folds spoke), ADR-025 (dual-policy threshold characterization — symmetric 1% targets + per-(rung, fold, seed) fitting + verification-reachability audit).


6. Code architecture

The work spans three repos:

  • prompt-injection-detection-prototype (this repo) — modelling: data loading, training, classification API, project-specific scoring code.
  • eval-toolkit — evaluation harness: metrics, bootstrap, calibration, threshold selection, leakage detection, slice-aware orchestration, reproducibility manifests, versioned JSON schemas.
  • runpod-deploy — cloud orchestration for training/eval runs on rented GPUs. the project’s additions: prediction-persistence pull-pattern + checkpoint upload-to-HF-Hub pattern.

The split is intentional: methodology curriculum and primitives live in eval-toolkit so they survive across iterations; cloud orchestration lives in runpod-deploy so it’s reusable across projects.

6.1 Module layout (per ADR-026)

[LOCKED: concern-grouped sub-packages under src/]

src/
  data/        # loaders, dedup, LODO splits, manifest validation
  training/    # ModernBERT loader, LoRA configurator, trainer
  scoring/     # reference-scorer adapters (one module per scorer)
  eval/        # calibration_battery, operating_points, slice_analysis
  utils/       # config_hash, paths, logging glue
scripts/       # CLI entrypoints — argparse + IO; orchestrate src/ calls
configs/
  runpod/      # canonical RunPod config per ADR-020
  rungs/       # per-rung YAML hyperparameters per SPEC §5 config discipline
  profiles/    # smoke vs canonical profile configs per ADR-027
  data/        # source manifest with HF SHAs per ADR-016
tests/
  conftest.py  # marker registration + shared fixtures
  test_invariants.py  # 25+ tests-as-invariants per SPEC §5
  fixtures/    # smoke-test fixture data (NOT real data)
  unit/        # pytest -m unit
  smoke/       # pytest -m smoke
  integration/ # pytest -m integration

Boundaries — src/ is library code (importable, no side effects); scripts/ is entrypoint glue (argparse + IO; not importable); configs/ is YAML data; tests/ is verification. Adding or moving a top-level src/ sub-package requires a superseding ADR.

6.2 Smoke vs canonical separation (per ADR-027)

[LOCKED: three Makefile targets stratified by execution context]

Target Execution context Compute Network Wall-clock Purpose
make smoke laptop only no GPU no network <10 min dev debugging + reviewer “does this wire together” check
make test-integration local GPU OR cloud pod GPU when available; skip gracefully when not optional ~5-10 min dev debugging on workstation GPU; pre-flight smoke on cloud pod
make headline-cloud RunPod (billed) H100/equivalent per ADR-020 gpu_order failover required hours; cost-cap-gated $125/job per ADR-020 + A-002 canonical evaluation deliverable — not a test

Honest framing (required in WRITEUP/methodology.md): math-correctness validation lives upstream in eval-toolkit (≥90% coverage floor, Hypothesis property tests, golden-output snapshots, doctests on math kernels). The local test layer in this prototype repo is debugging-grade — sufficient to catch glue-layer breakage before paying for cloud compute, not sufficient to substitute for upstream library validation. Reviewers consult eval-toolkit’s test suite for math-correctness evidence.

A separate make headline-dry-run target exposes runpod-deploy run --dry-run standalone for cost preview without provisioning.

6.3 Linked ADRs

ADR-026 (module layout), ADR-027 (smoke vs canonical), ADR-028 (coverage floor), ADR-029 (test marker strategy).


7. Verification & acceptance criteria

6-gate integration checklist for v1.0.0 submission tag (per ADR-039)

This iteration is submission-ready when all six gates pass:

  1. Zero [OPEN] in SPEC_SHEET.md — every slot reads [LOCKED: ... (per ADR-NNN)] OR [TBD-at-Phase-N] with explicit rationale. Verified via grep -c "\[OPEN\]" SPEC_SHEET.md returns 0 (excluding the doc-header Status: [OPEN] line which transitions to [LOCKED] at Phase 0 close).
  2. Zero open rows in SPEC_GREENFIELD.md ledger appendix — every row reads locked-to-X (see ADR-NNN) OR superseded-by-NNN OR deferred-to-phase-N with explicit rationale. Verified via awk '/^\| open \|/' SPEC_GREENFIELD.md returns 0 lines.
  3. All tests/test_invariants.py stubs unskipped + green — every @pytest.mark.skip decorator removed; pytest -m unit exits clean. Verified via pytest -m unit tests/test_invariants.py + pytest --collect-only shows zero skipped tests.
  4. SUBMISSION_AUDIT.md regenerates cleanly — every claim in Accepted OR Superseded state (no Proposed at submission tag). Verified via make audit (wraps scripts/regenerate_audit.py --check) exits 0.
  5. v0.9.0-rc1 rehearsal tag fired successfully before v1.0.0 (per ADR-033) — verified via git tag -l v0.9.0-rc1* returns at least one tag + gh run list --workflow publish.yml shows green status for that tag.
  6. All three reviewer URLs at v1.0.0 resolve — source pin at tree/v1.0.0 + live Quarto site at GH Pages URL + GH release page with CHANGELOG + _site.tar.gz asset (per ADR-033). Verified via curl --head returns HTTP 200 (or 301-redirect-to-200) for all three URLs.

Per-ADR acceptance_criterion: frontmatter fields collectively cover the granular gates (data manifests + calibration artefacts + threshold reachability + HF Hub model card schema + etc.). Gate 4 is the mechanical check that all per-ADR criteria are satisfied.

Kit-default §6 gates (preserved; subsumed by gates 3 + 4 above but listed explicitly for kit-level continuity)

  • make test passes (incl. invariants for class balance, source-disjoint, frozen-dataclass, no-emoji, reporting-completeness).
  • make lint clean.
  • evals/results.json schema-validated against eval-toolkit’s results.v1.json.
  • All assumptions with severity ≥ medium in assumptions.md appear in the WRITEUP caveats block.

Submission-readiness sign-off

SUBMISSION_TEMPLATE.md (or SUBMISSION.md cover-letter) quotes the 6 gates so the submission-readiness check is reviewer-readable at submission tag. The submission is not ready until every gate above passes.


8. SDD process notes

  1. Spec freeze: once this document is LOCKED, changes require an ADR.
  2. Phase 0 interview: [LOCKED] agent reads spec, surfaces decisions, human picks, decisions become ADRs. .
  3. Process gates, not outcome gates: phase gates check that work was done and tests pass — not that metrics hit a target. deliberately avoids tying phase movement to outcome numbers so that the eval reports what was found rather than what was needed to advance.
  4. Transcript capture: [LOCKED] every session where decisions are discussed produces a transcript in transcripts/. .
  5. Prediction persistence: [LOCKED] per-row predictions are persisted alongside metrics. runpod-deploy pulls per-row score artifacts so downstream analyses (calibration, threshold sweeps, ROC curves) run from persisted predictions without re-running inference.
  6. ADR cadence: one ADR per significant decision; format per Michael Nygard.
  7. Assumption updates: when an assumption is invalidated mid-implementation, update assumptions.md and write a corrective ADR.
  8. Tests-as-invariants: every spec claim that can be made executable as a test, must be.

Linked ADRs: ADR-001, ADR-025, ADR-026, ADR-027, ADR-028, ADR-029, ADR-030, ADR-031, ADR-032, ADR-033, ADR-034, ADR-035, ADR-036, ADR-037, ADR-038, ADR-039, ADR-040, ADR-041, ADR-042, ADR-043.


9. Submission deliverables (Phase 0-07)

[LOCKED] Submission deliverables locked at Phase 0-07 — see ADR-030 (deliverable format = Quarto HTML site via GH Actions; supersedes ADR-002 PDF + repo) + ADR-031 (reviewer reading paths via index.qmd + sidebar nav; supersedes ADR-004 PDF-as-hub framing) + ADR-032 (HF Hub publication = headline rungs only with model card discipline) + ADR-033 (release strategy = v0.9.0-rc1 rehearsal + v1.0.0 submission + v1.0.x post-submission patches; CHANGELOG + _site.tar.gz release asset) + ADR-034 (reproducibility tier = full ladder T0 eval-from-hub + T1 smoke + T3 headline-cloud).

Reviewer email at submission carries three URLs + private attachment: 1. Source pin — https://github.com/brandon-behring/prompt-injection-detection-prototype/tree/v1.0.0 2. Live rendered Quarto site — https://brandon-behring.github.io/prompt-injection-detection-prototype/ 3. GH release page — https://github.com/brandon-behring/prompt-injection-detection-prototype/releases/tag/v1.0.0 4. Transcripts as private attachment per existing convention (gitignored).

Linked ADRs: ADR-030, ADR-031, ADR-032, ADR-033, ADR-034.


9. Open questions deferred to future iterations

Pointer: see NEXT_STEPS.md §3 for the consolidated list of open questions surfaced during Phase 0-5. Live as of v1.0.0:

  • Does the contamination-tier ordering hold under harder OOD slates?
  • Does the LoRA → full-FT gap survive higher seed counts?
  • Does the single-class-slice convention generalize beyond AUPRC / AUROC to calibration metrics?

Future-iteration tactical work in NEXT_STEPS.md §1; aspirational directions in NEXT_STEPS.md §2.


Appendix: decision trace

53 ADRs accepted across Phase 0-00 through v1.0.4 close (decisions/ADR-001-*.md through decisions/ADR-053-*.md; ADR-051 + ADR-052 + ADR-053 are post-v1.0.0 patch ADRs per CHANGELOG). SUBMISSION_AUDIT.md regenerates from the ADR frontmatter via scripts/regenerate_audit.py and is a CI hard gate. Transcripts for the multi-turn decision conversations live under transcripts/<YYYY-MM-DD>__<slug>.md (gitignored; emailed to the reviewer separately at submission).