1. Goal & non-goals
Goal: deliver a methodology-disciplined characterisation of what successive capability layers (classical TF-IDF + LR floor → frozen ModernBERT-base linear probe → LoRA adapters → full fine-tune) add to prompt-injection detection across a 4-source LODO IID test slate + a 5-slice OOD slate (BIPIA + InjecAgent + JBB-Behaviors + XSTest + NotInject), with bootstrap CIs + paired-bootstrap rung-vs-rung + calibration battery + dual-policy threshold characterisation per ADRs 005-046, published as a Quarto static site + HF Hub model cards + v1.0.0 GitHub release. No rung promoted as a winner; honest unflattering results retained.
Non-goals: - Not optimizing for SOTA PR-AUC. - Not building a deployable service. Deployment is not on the roadmap. - Not creating a publishable benchmark. - [LOCKED: per ADR-005 + ADR-017] Not picking a leader rung — each rung’s trade-offs are characterized, no rung is promoted as the deployment recommendation. The rung-ladder IS the Pareto frontier (per ADR-005 methodology-over-metrics + ADR-017 trained-rung-slate-as-Pareto-instrument framing). - [LOCKED-via-omission] No additional non-goals surfaced during Phase 0-00 through Phase 0-08; the three above + the rung-recommendation non-goal cover the project scope.
Scope authority: the spec itself is the scope cap. Anything not specified here is out of scope. Adding scope post-spec-freeze requires an ADR with explicit “Why this is in scope now” justification.
2. Phases & process gates
the project work is structured into six phases. Each phase has a gate checklist of work-completed and tests-passing — not metric thresholds. The intent is to make movement between phases auditable, not to bind the project’s narrative to specific numerical outcomes.
Phase 0: Spec lock-in interview [LOCKED]
Gate: decisions/ directory contains an ADR for every Phase-0-resolved decision; transcripts/ directory contains the interview transcript.
Phase 1: Data
Gate: every checkbox ticked; tests/test_data.py and tests/test_leakage.py green.
Phase 2: Training
Gate: every checkbox ticked; training manifests schema-validated.
Phase 3: Evaluation
Gate: every checkbox ticked; evals/results.json parses cleanly.
Phase 4: Analysis
Gate: every checkbox ticked; analysis JSON outputs match schemas.
Phase 5: Writeup
Gate: every checkbox ticked; reviewer URLs (source pin at tree/v1.0.0 + live Quarto site + GH release page) all resolve; transcripts ready for private email attachment.
3. Data design
3.1 Train pool composition
[LOCKED: Path α — full source slate (per ADR-016)] — 4 positive sources + 2 benign sources + 5 OOD slices. HarmBench + Tensor Trust + LLMail-Inject deferred to afterword.
deepset/prompt-injections |
~500-650 (use all) |
Train pos |
Apache-2.0 |
1 |
Lakera/gandalf_ignore_instructions |
~800-1000 (use all) |
Train pos |
MIT |
2 |
Lakera/mosscap_prompt_injection |
3000 (cap) |
Train pos |
MIT |
3 |
hackaprompt/hackaprompt-dataset |
3000 (cap) |
Train pos |
per dataset card |
4 |
lmsys/lmsys-chat-1m |
10000 (cap; English-only filter) |
Train neg |
CC-BY-4.0 |
(stratified across folds) |
HuggingFaceH4/ultrachat_200k |
10000 (cap) |
Train neg |
Apache-2.0 |
(stratified across folds) |
leolee99/NotInject |
339 |
OOD hard-neg (over-defense) |
MIT |
(never trained) |
paul-rottger/xstest |
450 |
OOD hard-neg (over-refusal) |
per repo |
(never trained) |
JailbreakBench/JBB-Behaviors |
200 (100 harmful + 100 benign) |
OOD mixed |
MIT |
(never trained) |
microsoft/BIPIA |
per-task |
OOD indirect (zero-shot per ADR-014) |
per repo |
(never trained) |
uiuc-kang-lab/InjecAgent |
1054 |
OOD agentic (stretch probe) |
per repo |
(never trained) |
Benign subsample ceilings per source: [LOCKED: 3K positives per source for mosscap+HackAPrompt; use-all for deepset+Lakera-gandalf post-dedup; 10K benigns per source for LMSYS+UltraChat; random subsample at seed=42 (per ADR-016)]. Class balance per LODO training pool ≈ 1:2 to 1:2.7 (positives:benigns). Quality-filtered HackAPrompt + attack-type-stratified + length-stratified subsamples deferred to afterword.
3.2 Splits
[LOCKED: LODO k=4 over positive sources + 3 seeds per LODO fold; no internal k-fold (per ADR-016)]. Source-disjoint Leave-One-Dataset-Out at outer level (4 folds, one held-out positive source per fold) + 3 random-initialization seeds = 12 observations per rung. With the 4-rung trained slate locked by ADR-017 + ADR-019 (TF-IDF+LR + ModernBERT × {frozen-probe, LoRA, full-FT}), this is 48 trained runs total (4 rungs × 3 seeds × 4 LODO folds); 12 are sklearn CPU runs (TF-IDF+LR), 36 are H100/equivalent bf16 transformer runs with per-epoch prediction save per ADR-019 (72 transformer prediction files + 12 TF-IDF+LR prediction files = 84 trained-rung prediction parquets). Within each LODO fold: single 80/20 train/val random split (no nested k-fold); val used for threshold selection + calibration fitting + early-stopping per ADR-011 Guarantee 6 (NOT used for hyperparameter tuning per SPEC §2 hyperparameter-immutability). Per-rung bootstrap CIs from 12 observations (10K bootstrap iterations, BCa marginal per ADR-006); rung-vs-rung paired-bootstrap uses (LODO-fold × seed) pairing; MDE on Δ-AUROC ≈ 0.03. Stratified k-fold within LODO (Fomin 2025 / Nadeau-Bengio 2003 variance decomposition; ~5x compute) deferred to afterword.
3.3 Dedup, leakage prevention, cross-source label conflicts
- Semantic dedup:
[LOCKED: sentence-transformers/all-MiniLM-L6-v2 cosine at threshold 0.80; simplified calibration via FPR+FNR on 50-pair labeled holdout persisted to evals/dedup_calibration.json (per ADR-016)]. Label-aware (within (source, label) cells); deterministic first-occurrence retention; cross-label minimal pairs preserved per SPEC_GREENFIELD lock. MPNet-base-v2 + full 4-gate selection rule + cross-encoder reranker deferred to afterword.
- Cross-source minimal pairs:
[LOCKED] preserve-and-flag.
- Cross-source benign dedup ordering:
[LOCKED: within-source-first → cross-source (LMSYS-priority tiebreak) → LODO split (per ADR-016)]. Pipeline: within-source dedup pass per source → cross-source dedup pass (LMSYS-priority on cross-source near-duplicates because LMSYS is real-user data; UltraChat is synthetic) → split into LODO folds with benign stratification.
- Leakage invariants:
tests/test_leakage.py asserts no exact-hash and no high-cosine train-test overlap.
- Reference-scorer training-overlap audit:
[LOCKED] see WRITEUP §3.3 + EVIDENCE.md §1–2.
Truncation policy for inputs > length cap: [LOCKED: adaptive-chunked-max-pool stride=cap//2 at eval time; head-truncation at training time (per ADR-014)]. Training-positives are short so head-truncation rarely bites at train time (HF tokenizer default truncation_side="right"). At eval time, inputs exceeding the cap are split into overlapping chunks of size cap with stride cap // 2 (50 percent overlap so no token sits at a chunk boundary in both chunks); each chunk is scored independently; per-sample score is the max over chunk scores (max-pool aggregation — matches adversarial threat model). Under ADR-015 single-backbone refinement (ModernBERT-base at 8K native), adaptive chunked rarely activates (only on samples exceeding 8K tokens — about 5 percent of BIPIA per dossier estimate). Reference rungs run at their published native configurations including their native truncation policies (ProtectAI head-truncation at 512; Lakera as-API; LLM-judges receive full sample). Mandatory chunked-vs-head ablation on the BIPIA slice lives in WRITEUP/truncation-ablation.md. Phase 1 validation checkpoint: if BIPIA outlier-rate above 8K exceeds 15 percent of the slice, a superseding ADR-016 adjusts chunk-stride or aggregation policy.
3.4 OOD slate
[LOCKED: 5 OOD slices (per ADR-016) reported in two aggregation views (per ADR-021)] — direct over-defense + over-refusal + mixed-direct + indirect zero-shot + agentic-stretch. HarmBench + Tensor Trust + LLMail-Inject deferred to afterword as named next-iteration extensions.
Aggregation layout (per ADR-021): PDF executive headline table carries a single pooled-OOD column per rung (concatenated rows across the 5 slices, single AUPRC + AUROC + recall@FPR + ECE + Brier per rung). Methodology spoke at WRITEUP/ood-analysis.md (new file) carries the 5-by-rung per-slice grid with per-slice bootstrap CIs computed on the same persisted predictions via paired-bootstrap apparatus per ADR-006 + ADR-022 — no extra compute beyond additional metric calls. Pooled-and-per-slice reporting applies ADR-004 hub-and-spoke framing to OOD: pooled for A1 (hiring manager exec scan); per-slice for A2 (ML researcher generalization-question-by-question read). Aligns with Demsar 2006 JMLR multi-dataset reporting guidance.
| NotInject |
leolee99/NotInject |
Hard-negative (benign-with-injection-triggers) |
Tests over-defense per InjecGuard 2024 methodology; explicitly invites worse-but-honest evaluation per ADR-005 Principle 2 |
| XSTest |
paul-rottger/xstest |
Hard-negative (over-refusal) |
Tests exaggerated-safety patterns per Röttger 2024 NAACL |
| JBB-Behaviors |
JailbreakBench/JBB-Behaviors |
Mixed (100 harmful + 100 benign) |
Standardized misuse-behavior evaluation per Chao 2024 NeurIPS D&B |
| BIPIA |
microsoft/BIPIA |
Indirect (zero-shot OOD per ADR-014 Q1) |
Indirect-injection benchmark per Yi 2023 KDD; the load-bearing zero-shot transfer measurement |
| InjecAgent |
uiuc-kang-lab/InjecAgent |
Agentic (stretch probe) |
Tool-integrated agent injection per Zhan 2024 ACL; agentic transfer-of-transfer caveat per ADR-010 Bound 2 |
Linked ADRs: ADR-014 (threat-model bundle — attack-class scope), ADR-015 (rung architecture — 3 ModernBERT-base trained + 4 reference rungs), ADR-016 (this — data design bundle), ADR-008 (data scope brief-level locks — preserved), ADR-041 (Phase 1 implementation bundle — manifest rich-schema + live-fetch SHA pinning + manifest_validation.py placement + loader dispatch + stratified-cosine-band dedup holdout + slate-plus-templates contamination corpus + per-fold parquet materialization).
3.5 Phase 1 implementation status
[Phase 1 closed per ADR-041] Operationalisation of §3.1–3.4 locks; all 6 commits green. Per-commit status:
| Commit 1 |
configs/data/source_manifest.yaml (live-fetched SHAs; rich schema; bump_history=[]; relocated from data/ per ADR-044 Q2) + src/data/manifest_validation.py + scripts/pin_source_manifest.py |
test_source_manifest_schema_valid |
green |
| Commit 2 |
src/data/loaders.py (HF dispatch + 11 normalizers) + tests/smoke/test_loaders_smoke.py (3 small HF sources) |
smoke tests |
green (3 smoke + dispatch unit) |
| Commit 3 |
src/data/dedup.py + scripts/build_dedup_holdout.py + scripts/calibrate_dedup.py + 4 smoke tests; preliminary evals/dedup_calibration.json via ADR-042 LLM-pre-label bootstrap (gpt-4o-2024-08-06; full 4-source coverage; FPR=0.00 FNR=0.33 at locked 0.80; FPR jumps to 0.063 at 0.75 — 0.80 lock at the precision-recall knee) |
test_dedup_calibration_persisted green |
green; human_verified_pct=0 pending Brandon’s hand-examination per ADR-042 |
| Commit 4 |
src/data/splits.py (LODO k=4 x 3 seeds x stratified 80/20) + materialize_splits + materialize_index_masks + 9 smoke tests |
test_class_balance_per_fold + test_source_disjoint_train_test (unskip in Commit 5 with real data) |
green |
| Commit 5 |
src/data/audit.py + src/data/templates.py + scripts/extract_hackaprompt_templates.py + scripts/run_data_pipeline.py end-to-end orchestrator + ADR-043 post-split leakage cleanup; evals/{data_audit,leakage_report,contamination_scan}.json materialized (4707 deduped positives + 17246 deduped benigns + 1101 OOD; 180 leaked train rows dropped via ADR-043; A-005 triggers 1+2 clean; leakage_clean=True) |
test_benign_contamination_scan_clean + test_class_balance_per_fold + test_source_disjoint_train_test all green |
green (5 invariants total) |
| Commit 6 |
Makefile Phase 1 targets (data-pin-manifest, data-prepare umbrella, data-fetch/data-dedup/data-splits/data-audit ADR-041-Q7-compat aliases, data-templates, data-dedup-{holdout,prelabel,calibrate}) + docs/ROADMAP.md Phase 1 close note + SUBMISSION_AUDIT regen + transcript checkpoint + push |
n/a |
green |
3.5.1 Phase 1 library-first carryforward refactor (per ADR-047)
[Phase 1 carryforward refactor closed per ADR-047 at Commit 4 2026-05-16] Triggered by Phase 4 entry walkthrough Q6 user reaffirmation of the library-first invariant as project-wide; retroactive audit identified 4 hand-rolls in src/data/ where eval-toolkit ships fitting primitives. Two upstream contributions filed at audit close: issue #18 (wire 50-pair golden dedup-holdout into eval-toolkit CI fixtures); issue #19 (3-pattern cookbook docs). Each refactor commit deletes orphaned local helpers in-commit per the no-orphaned-code discipline (saved as memory 2026-05-16).
| Commit 1 (ADR-047 setup) |
ADR-047 + SPEC_SHEET §3.5.1 + upstream issues #18 + #19 filed + decisions/upstream_issues.md ledger updated + SUBMISSION_AUDIT regen |
n/a |
green |
| Commit 2 (splits refactor) |
src/data/splits.py::make_splits consumes eval_toolkit.splits.SourceDisjointKFoldSplitter; project glue maps upstream-shuffled fold order back to TRAIN_POSITIVE_SOURCES tuple order (deterministic fold_id-to-source mapping preserved across refactor); per-seed stratified 80/20 train/val + benigns-in-every-train-pool preserved |
9 splits smoke tests + 5 invariants (test_class_balance_per_fold + test_source_disjoint_train_test + …) all pass |
green |
| Commit 3 (dedup refactor) |
src/data/dedup.py::{dedup_within_source, drop_train_test_leakage, dedup_cross_source_benigns} consume eval_toolkit.text_dedup.{near_dedup, EmbeddingCosineStrategy(embedder=compute_embeddings), EmbeddingCosineStrategy.pairs_across}; _greedy_first_occurrence_mask deleted in-commit (no remaining callers); pairwise_cosines retained pending Commit 4 (still has callers in audit.py + build_dedup_holdout.py + test); project-owned embedder glue (get_encoder + compute_embeddings + encoder_revision_sha) preserved; compute_embeddings signature broadened from list[str] to Sequence[str] for upstream Callable[[Sequence[str]], ndarray] Protocol compat (non-breaking — all callers pass list) |
4 dedup smoke tests pass (including test_dedup_cross_source_lmsys_priority priority-source reason preservation); 123/123 smoke total + 10 invariants pass; mypy + ruff green |
green |
| Commit 4 (audit refactor + close) |
src/data/audit.py::compute_leakage_report consumes run_leakage_checks([CrossSplitLeakageCheck]) per fold (ExactDuplicateCheck + NearDuplicateCheck dropped per implementation note — they would always report zero findings post-dedup_within_source); compute_contamination_scan consumes EmbeddingCosineStrategy.pairs_across(query, reference, k=1) + project per-source aggregation glue; project-dict output schemas preserved for both. _per_row_max_cosine_to_ref (audit.py local helper) deleted in-commit. pairwise_cosines (dedup.py) deleted in-commit (now truly orphaned after audit.py + build_dedup_holdout.py refactors away from it). test_pairwise_cosines_symmetric (tested deleted primitive) deleted in-commit. scripts/build_dedup_holdout.py::_enumerate_within_source_pairs refactored to use EmbeddingCosineStrategy.pairs_within(texts, n-1) so the script’s pairwise_cosines import dependency is severed. Output schema for evals/leakage_report.json preserved (CrossSplitLeakageCheck count maps to existing cosine_ge_085_overlaps field) — no schema migration needed |
6 audit+dedup smoke tests pass (test_compute_data_audit_yields_per_source_counts + test_compute_leakage_report_zero_overlaps_on_disjoint_splits + test_compute_contamination_scan_unrelated_benigns_clean + test_compute_embeddings_shape_and_norm + test_dedup_within_source_drops_near_duplicates + test_dedup_cross_source_lmsys_priority); 122/122 smoke total (was 123; -1 from deleted test_pairwise_cosines_symmetric) + 10 invariants pass; mypy + ruff green |
green |
Phase 1 library-first carryforward refactor CLOSED at Commit 4. ADR-046 (Phase 4 implementation bundle per prior 7-question ratification) writing unblocked; Phase 4 Commit 1 begins after ADR-046 lands.
3.6 Phase 2 implementation status
[Phase 2 closed per ADR-044] Operationalisation of §4 locks; all 6 commits green. Per-commit status:
| Commit 1 |
ADR-044 (Phase 2 implementation bundle; partial supersession of ADR-019 seed slate (42,1337,2025)→(42,43,44)) + manifest move data/→configs/data/ per Q2 + 10-file path-ref update |
test_source_manifest_schema_valid (still green at new path) |
green |
| Commit 2 |
src/training/{batch_table, lora_config, training_args, weighted_trainer, load_modernbert, softmax_cast}.py per ADR-019 + ADR-020 + 18 smoke tests |
test_flash_attn_fallback_present + test_effective_batch_constant_across_gpu_classes green |
green (7 invariants total) |
| Commit 3 |
src/training/{tfidf_lr, train_classical}.py per ADR-017 + configs/rungs/classical_floor.yaml + scripts/train_classical_floor.py + 5 smoke tests |
test_classical_floor_rung_present green |
green (8 invariants total) |
| Commit 4 |
src/training/train_modernbert.py multi-rung HF Trainer dispatch (frozen_probe + lora + full_ft via classifier_type) + configs/rungs/{frozen_probe, lora, full_ft}.yaml (ModernBERT-base SHA pinned at 8949b909) + PerEpochPredictionsCallback per ADR-019 + 10 smoke tests |
test_per_epoch_predictions_present (deferred to canonical run; needs GPU) |
green (8 invariants total; per-epoch invariant deferred) |
| Commit 5 |
configs/runpod/headline-{frozen_probe, lora, full_ft}.yaml (runpod-deploy schema_version 2 — H100/H200/A100/L40S failover; cost caps $40/$60/$100) + scripts/train_rung.py per-rung sweep + scripts/cost_rollup.py aggregator + 8 smoke tests |
n/a (cloud runs at canonical) |
green (code lands; runs deferred to canonical) |
| Commit 6 |
tests/fixtures/processed/fold-0/seed-42/*.parquet (100/24/24 rows; 12KB total; reproducible via scripts/generate_fixtures.py at seed=1337) + configs/profiles/classical_fixtures.yaml + tests/smoke/test_smoke_pipeline.py (3 tests; fixture-pipeline + idempotency) + Makefile Phase 2 targets (generate-fixtures, train-classical-floor, train-rung RUNG=<...>, cost-rollup, cost-rollup-check, headline-{frozen-probe,lora,full-ft}) + make smoke extended to fixture-pipeline pass per ADR-027 line 75 + docs/ROADMAP.md Phase 2 close note |
n/a |
green |
3.7 Phase 3 implementation status
[Phase 3 closed per ADR-045] Operationalisation of §5 locks; all 6 commits green. Per-commit status:
| Commit 1 |
ADR-045 (Phase 3 implementation bundle; scoring-first contract + 6-commit cadence + tiered ref-scorers + classical-scaffold + full-pairwise persistence with headline-only WRITEUP + pydantic schema validation) + SPEC_SHEET §3.7 status table + SUBMISSION_AUDIT regen |
n/a |
green |
| Commit 2 |
src/scoring/{protectai, llm_judge_base, openai_judge, anthropic_judge}.py per ADR-018 + src/eval/schemas.py (pydantic models — PredictionsRowModel, MetricsRecordModel, SliceMetricsModel, OperatingPointModel, CalibrationRecordModel, ReachabilityAuditModel, BootstrapCellModel) + versioned prompt template at src/scoring/prompts/prompt_template_v1.md + Tier-A (ProtectAI) CI smoke + Tier-B (LLM judges) cache infrastructure at evals/audit/llm_judge_cache/<judge>__<sha256-prefix>.json per A-007 + A-014 + 22 smoke tests |
test_reference_scorer_schema_uniform green |
green (9 invariants total) |
| Commit 3 |
src/eval/calibration_battery.py per ADR-023 (eval-toolkit ECE 4-variant matrix expected_calibration_error{,_debiased,_l2,_l2_debiased} + expected_calibration_error_equal_mass headline at n_bins=15 + brier_score + brier_decomposition reliability/resolution/uncertainty + fit_temperature + fit_isotonic_calibrator + reliability_curve; validation-only fit per ADR-011 Guarantee 6; proba_to_logprobs + apply_temperature helpers for binary-to-2-col-logit conversion) + 12 smoke tests |
test_calibration_battery_outputs_4ece_plus_brier green |
green (10 invariants total) |
| Commit 4 |
src/eval/operating_points.py per ADR-025 (TargetFPRSelector(0.01) detection + TargetRecallSelector(0.99) verification per-(rung, fold, seed) val fit; fit_operating_point + fit_dual_policy_for_cell + compute_reachability_audit per A-009) + src/eval/slice_analysis.py per ADR-021 (5-slice OOD slate compute_metric_record + pooled-headline compute_pooled_ood_record + per-slice spoke aggregate_slice_across_observations + 0.1% pinpoint volatility surfaces compute_pinpoint_volatility per ADR-021 line 53-65) + 20 smoke tests |
module-level smoke tests cover contract (test_dual_policy_threshold_pairing + test_verification_reachability_audit + test_ood_aggregation_layout + test_recall_at_fpr_pinpoint_volatility are integration-level invariants deferred to Commit 5 when scripts wire end-to-end) |
green (10 invariants total; 4 stubs deferred to Commit 5) |
| Commit 5 |
scripts/run_metrics_battery.py (loads predictions parquets per rung × fold × seed × slice; emits MetricsRecordModel + pooled-OOD records via src/eval/slice_analysis.py) + scripts/fit_dual_policy_thresholds.py (sweeps trained-rung × fold × seed; reference scorers filtered via TRAINED_RUNGS allowlist per SPEC §4; emits OperatingPointModel + ReachabilityAuditModel nested-JSON per A-009) + scripts/run_bootstrap_battery.py (full-pairwise C(rungs, 2) × slices × metrics via eval_toolkit.bootstrap.paired_bootstrap_diff; persists BootstrapCellModel per Q6 user refinement so post-hoc questions answer from disk; WRITEUP features the 3 headline comparisons) + scripts/eval_from_hub.py T0-tier dry-run surface per ADR-034 (full body gated on Phase 5 ADR-032 publication) + 5 subprocess-based smoke tests covering all 4 entrypoints |
smoke covers contract; integration invariants (test_dual_policy_threshold_pairing + test_verification_reachability_audit + test_ood_aggregation_layout + test_recall_at_fpr_pinpoint_volatility + test_bootstrap_n_and_stability_check + test_paired_across_rungs_pairing + test_cross_fold_ci_methodology) remain skip-marked pending Phase 4 canonical evals run on full 84-parquet trained-rung output |
green (10 invariants total; 7 integration stubs deferred to Phase 4) |
| Commit 6 |
Makefile Phase 3 targets (eval-classical-floor, eval-reference-scorers-free Tier-A scaffold, eval-reference-scorers-paid Tier-B with interactive approval per ADR-045 Q4, metrics-battery, dual-policy-thresholds, bootstrap-battery, eval-from-hub Phase-3-wired) + make smoke extension (now includes run_metrics_battery.py end-to-end pass on classical-floor fixture predictions + eval_from_hub.py --dry-run per ADR-027 sub-10-min budget) + tests/fixtures/metrics/ gitignored + docs/ROADMAP.md Phase 3 close note + Phase 4 unblock |
n/a |
green |
3.8 Phase 4 implementation status
[Phase 4 closed per ADR-046] Operationalisation of §5 plus ADR-006 + ADR-022 + ADR-024 + ADR-025 (plus partial supersession of ROADMAP TBD-at-Phase-4 reference-scorer-audit-deferred framing per ADR-046 Q5 user override → include-now-locked); all 6 commits green. Reference-scorer slate further narrowed by ADR-050 at Phase 4-5 transition (LLM judges dropped on cost; full-FT OOD dropped on FUSE crash). Per-commit status:
| Commit 1 |
ADR-046 (Phase 4 implementation bundle; 6-commit cadence + scaffold-with-classical + always-emit-both-CIs auto-flag + MDE-on-every-emitted-CI + LLM-rater audit included per user override + library-first hybrid figures per project-wide invariant codification + Phase 5 prep deferred) + SPEC_SHEET §3.8 status table + SUBMISSION_AUDIT regen |
n/a |
green |
| Commit 2 |
src/eval/marginal_bootstrap.py per ADR-022 (bootstrap_ci wrappers; 10K @ seed=1 headline + 10K @ seed=2 stability check) + src/eval/cross_fold_ci.py (cv_clt_ci headline per ADR-024; block-bootstrap spoke fields scaffolded as None pending Commit 3) + src/eval/mde.py per ADR-006 (mde_from_paired_ci_record direct wrap + mde_from_marginal_ci_record closed-form workaround per upstream issue #20) + 3 new pydantic schemas (MarginalBootstrapCellModel + CrossFoldCIModel + MDECellModel) + 18 smoke tests |
test_marginal_bootstrap_seed_stability + test_cv_clt_ci_headline_present (deferred-unskip at canonical evals run; 44 total tests collect cleanly) |
green |
| Commit 3 |
src/eval/cross_fold_ci.py extension — compute_block_bootstrap_on_folds (inline NumPy workaround per upstream issue #21; vectorized resample of K folds with replacement; percentile CI per ADR-022) + compute_a_008_flag (strict > 1.5 per A_008_RATIO_THRESHOLD; degenerate-cv_clt edge case handled) + compute_cross_fold_ci_cell always populates both cv_clt + block fields + the boolean flag; 10 new smoke tests cover halfwidth ordering + seed determinism + flag rule + threshold constant |
test_block_bootstrap_folds_spoke_present + test_a_008_flag_fired_when_ratio_exceeds_1_5 (deferred-unskip at canonical evals run; 46 total tests collect) |
green |
| Commit 4 |
src/eval/figures.py per Q6 — library-first hybrid 7-figure slate; consumes eval_toolkit.plotting.{plot_pr_curve, plot_reliability_diagram, plot_bootstrap_distribution, plot_lift_ci, save_figure, set_plot_style, PALETTE} for F3 + F4 + F6-right + F7-subpanels + project glue for F1 Pareto + F2 ROC + F5 heatmap + F6-left + F7 grid layout (cites upstream issues #14 + #15 + #16 + new #22 plot_metric_bars ax kwarg for F6-left as TODOs); SVG output via save_figure writes a {stem}.meta.json sidecar carrying provenance per ADR-030; 14 smoke tests pass headless via matplotlib.use("Agg"); matplotlib graduated to main deps from notebook extras |
test_figures_slate_7_svgs_present + test_save_figure_provenance_chunks_present (deferred-unskip when Commit 5 orchestrates the canonical slate; 48 total tests collect) |
green |
| Commit 5 |
Orchestration scripts — scripts/run_marginal_bootstrap.py per Q4 (sweeps marginal cells x both seeds per ADR-022) + scripts/run_cv_clt_ci.py per Q3 (sweeps both cv_clt + block fields + a_008 flag) + scripts/run_mde.py per Q4 (aggregates MDE across paired + marginal + cv_clt + block cells via closed-form path; emits evals/audit/mde_per_cell.parquet) + scripts/render_figures.py per Q6 (canonical + scaffold paths; emits docs/plots/F{1..7}.svg + per-figure .meta.json provenance sidecars) + scripts/audit_reference_scorers.py per Q5 user override (samples disagreement pairs vs trained rung, interactive approval gate per ADR-020 + --dry-run cost preview + --assume-yes for CI; uses OpenAIJudge from Phase 3 Commit 2 with locked OPENAI_JUDGE_MODEL per ADR-018) + 5 subprocess-based smoke tests |
smoke covers contract; canonical-data invariants deferred to operator-gated runs |
green |
| Commit 6 |
Makefile Phase 4 targets (marginal-bootstrap, cv-clt-ci, mde-battery, render-figures, audit-reference-scorers, phase4-all umbrella) + extended make smoke (now also runs scripts/render_figures.py --scaffold writing 7 SVG + sidecars to tests/fixtures/plots/; under ADR-027 sub-10-min budget) + tests/fixtures/plots/ gitignored + docs/ROADMAP.md Phase 4 status + close note + 6-step v0.9.0-rc1 rehearsal-tag dispatch checklist per ADR-033 + Phase 5 (Writeup) unblock |
n/a |
green |
After Commit 6 lands + invariants pass, v0.9.0-rc1 rehearsal tag fires triggering the full publish pipeline (Quarto site build per ADR-030 + GH Pages deploy + HF Hub model card pushes per ADR-032) as a 24+ hour dress-rehearsal per ADR-033 + ADR-038. Phase 5 (Writeup) begins after.
4. Model recipe (locked, no gridsearch)
Each rung is locked before training begins. No val-set hyperparameter gridsearch.
4.1 Rung 1 — classical floor (TF-IDF + LR)
[LOCKED: sklearn TF-IDF + LogisticRegression (per ADR-017)] — Combined sparse features via FeatureUnion: word 1-2-grams (max_features=15000, sublinear_tf=True, lowercase=True, strip_accents=unicode) + char 3-5-grams (max_features=15000); concatenated → up to 30K-dim sparse matrix. Classifier: LogisticRegression(solver='liblinear', C=1.0, class_weight='balanced', max_iter=1000) — fit-to-convergence; no epoch concept; deterministic per seed (ADR-006 slate: 42, 1337, 2025). 3 seeds × 4 LODO folds = 12 sklearn CPU runs. Contamination state: verified_disjoint (trained on our LODO splits by construction).
4.2 Rung 2 — frozen-features probe
[LOCKED: ModernBERT-base frozen-probe (per ADR-015 + ADR-019)] — Transformer body frozen; linear classifier head (2-class) trained on [CLS]-pooled embeddings via WeightedTrainer subclass (CrossEntropyLoss with per-fold sklearn class_weight='balanced' tensor; per ADR-019). bf16=True with fp32 cast before final softmax. 2 epochs; cosine LR schedule with 10% warmup; lr=1e-4. Per-epoch checkpoint + per-epoch parquet predictions persisted. Dual role per ADR-017: candidate detector in headline table AND diagnostic anchor in methodology spoke. Contamination state: backbone-partial-disjoint (fine-tuning disjoint by LODO; backbone pretrain corpus may overlap eval sources).
4.3 Rung 3 — LoRA adapter-fine-tuned
[LOCKED: ModernBERT-base LoRA (per ADR-015 + ADR-019)] — PEFT-LoRA adapters; backbone frozen; classifier head full-FT via modules_to_save=["classifier"]. Locked recipe (per ADR-019): LoraConfig(r=8, lora_alpha=16, lora_dropout=0.1, target_modules=["Wqkv", "attn.Wo", "mlp.Wo", "mlp.Wi"], task_type="SEQ_CLS", bias="none") — explicit module enumeration (4 LoRA modules per encoder × 22 layers = 88 adapter modules), not "all-linear" auto-detection. TrainingArguments: lr=1e-4, warmup_ratio=0.10, lr_scheduler_type=cosine, per_device_train_batch_size=16 + gradient_accumulation_steps=2 (effective batch 32; ADR-020 BATCH_TABLE scales for non-H100 classes), num_train_epochs=2, bf16=True, max_grad_norm=1.0, weight_decay=0.01, save_strategy=“epoch”, eval_strategy=“no”. DataCollatorWithPadding(max_length=8192, pad_to_multiple_of=8) — dynamic padding, head-truncation per ADR-014 Q4 training-time. Per-fold sklearn class_weight='balanced' via WeightedTrainer. Contamination state: backbone-partial-disjoint.
4.4 Rung 4 — full-FT trained backbone
[LOCKED: ModernBERT-base full-FT (per ADR-015 + ADR-019)] — Full backbone parameters trainable; standard HF Trainer + eval-toolkit metric callbacks + WeightedTrainer subclass for class-weighted CE. Same recipe as Rung 3 (lr=1e-4, 2 epochs, bf16, effective batch 32, etc.). Intermediate (epoch-1) weight checkpoints not persisted to disk (~1.8 GB throwaway across 12 runs); per-row predictions for epoch-1 are saved without the underlying weights since predictions are the audit-relevant artifact. Final epoch checkpoint is persisted per ADR-013 pre-teardown checklist. Contamination state: backbone-partial-disjoint.
4.5 Reference rungs — 2 published baselines at native config (post-ADR-050 narrowing)
[LOCKED: 2 reference rungs (per ADR-018 → ADR-050 narrow supersession; LLM judges dropped on Phase 4 cost re-estimation, ~16× envelope overrun)] — Lakera Guard dropped at Phase 0-03 per ADR-018 (afterword extension). LLM judges (gpt-4o-2024-08-06 + claude-sonnet-4-6) dropped at Phase 4 per ADR-050. The vendor_black_box contamination tier therefore carries 0 rungs in this submission; the contamination-stratification gradient compresses from 4 tiers to 3 (verified_disjoint + backbone-partial-disjoint + suspected_contamination). ProtectAI v1 + v2 + TF-IDF+LR remain as the 3-rung reference slate.
- R-ProtectAI-v1:
protectai/deberta-v3-base-prompt-injection (HF revision SHA-pinned at Phase 1 per ADR-016 manifest); inference-only at native config (head-truncation at 512); bf16 on GPU. Contamination state: suspected_contamination.
- R-ProtectAI-v2:
protectai/deberta-v3-base-prompt-injection-v2 (HF revision SHA-pinned at Phase 1); inference-only at native config (head-truncation at 512); bf16 on GPU. Contamination state: suspected_contamination.
Each reference rung is called at its published native configuration including its native truncation policy. Apples-to-apples comparison against deployed baselines requires testing them as they exist, not as preprocessed by us. Training-data overlap audit per EVIDENCE.md §1-2. The methodology spoke includes a dedicated Contamination stratification subsection narrating the three-tier disclosure gradient (verified_disjoint → backbone-partial-disjoint → suspected_contamination); the trained-rung-vs-reference comparison is framed as “what trained-from-scratch (TF-IDF+LR verified_disjoint anchor) achieves versus what potentially-memorized off-the-shelf models achieve.”
LODO comparison: 3-rung trained ladder (frozen-probe + LoRA + full-FT) retained per ADR-050 Revision 2 (full-FT LODO predictions survived Phase 2). OOD comparison: 2-rung trained (frozen-probe + LoRA) + classical floor (tfidf-lr) + 2 reference scorers (ProtectAI v1 + v2) = 5-rung OOD slate. full-FT OOD inference dropped at X11 FUSE EIO crash per ADR-050.
4.6 Per-epoch prediction-save discipline
[LOCKED: epoch-2 headline, epoch-1 diagnostic (per ADR-019)] — Per-row predictions persisted for every transformer (rung, seed, fold, epoch) combination → 72 transformer prediction parquets + 12 TF-IDF+LR (no-epoch) + 16 reference rungs = 100 total prediction files. File-path convention: evals/predictions/<rung>__fold<F>__seed<S>__epoch<N>.parquet. Discipline rule pre-committed: epoch-2 predictions are the publication number; epoch-1 predictions are reported as a diagnostic ablation in the methodology spoke (the per-(rung, seed, fold) epoch-1→epoch-2 AUPRC delta plot surfaces undertraining-vs-overfitting boundaries).
4.7 Matched-budget controls
[LOCKED: per-axis (per ADR-018)] — Match data (same train/eval splits per ADR-016) + eval methodology (same metrics, same statistical machinery per ADR-006); do NOT match training compute. Each rung uses its natural recipe; training compute is reported alongside the metric so AUPRC-vs-compute can be plotted as a Pareto frontier — the rung-ladder IS the Pareto frontier. Per-axis matching is the only framing that coherently handles the heterogeneous cost classes (LLM-judge $/call, trained rungs GPU-minutes, ProtectAI inference-only). Documented as a dedicated Matched-budget framing subsection in the methodology spoke.
4.8 Compute infrastructure (per ADR-020)
[LOCKED: runpod-deploy 0.7.7 with 8-class GPU failover, dual-DC, adaptive batch, dual-layer cost cap] — - pod.gpu_order (priority): H100 80GB HBM3 → H100 NVL → H100 SXM → H100 PCIe → H200 → H200 NVL → A100-SXM4-80GB → A100 80GB PCIe → L40S → A100-SXM4-40GB (emergency) - pod.datacenters: [US-MD-1, EU-RO-1] (dual-DC failover) - BATCH_TABLE (preserves effective batch = 32 across GPU classes): H100/H200/A100-80G use (per_device=16, grad_accum=2); A100-40G/L40S use (8, 4); L40 uses (4, 8). Pre-locked lookup keyed on torch.cuda.get_device_name; fail-loud on unlisted GPU. - flash_attention_2 fallback per runpod-deploy recipe: try/except (ValueError, ImportError) around model load → degrades to stock SDPA on smaller classes; events.emit_event("flash_attn_fallback", ...) for audit. - Cost cap (dual-layer): per-job budget.cost_cap_usd=125.0 (orchestrator-enforced; = A-002 upper-bound soft cap) + project-wide hard cap $200 enforced by scripts/cost_rollup.py CI-gated check aggregating across all per-pod runpod_deploy_pull_manifest.json files + API call logs. - assumed_hourly_rate_usd=3.50 (H100 spot midpoint; reconciled post-first-run per cost-reconciliation recipe). - Preflight discipline: runpod-deploy validate --all + runpod-deploy run --dry-run before any billed run. - Cost tracking (dual-layer): per-pod automatic via runpod_deploy_pull_manifest.json + per-Makefile-target rollup in evals/cost_ledger.csv (cols: timestamp, target, est_cost_usd, actual_cost_usd, gpu_hours, api_calls, notes).
4.9 Future-work extensions (afterword)
[LOCKED: NONE in primary slate; future-work extensions named per ADR-015 + ADR-017 + ADR-018 + ADR-019 alternatives] — ModernBERT-large size-up, matched-context cross-backbone control, alternate classification head (MLP), calibration via validation-fit temperature, Lakera Guard re-addition (ToS-permitting), frontier-tier judge ablation (gpt-4.1 / opus-4-7), reasoning-judge ablation (o1/o3), multi-judge ensemble, rank ablation (r=4/r=16/r=32), target-module ablation (Q+V vs all-linear), DoRA / rs-LoRA / VeRA, 1-epoch-locked schedule comparison, 3-epoch convergence study, focal loss vs class-weighting, per-source learning-curve decomposition, hashing vectorizer for long docs, calibrated LR via CalibratedClassifierCV. Calibration is a separate methodology axis (Phase 0-04 walks the calibration battery, ledger row 343).
Linked ADRs: ADR-015 + ADR-017 + ADR-018 + ADR-019 + ADR-020 (compute + cost discipline).
5. Eval design
5.1 Primary descriptive metrics
[LOCKED: PR-AUC + ROC-AUC + recall@FPR={0.1pct-pooled-only, 1pct, 5pct} + ECE-equal-mass(n_bins=15, quantile) + Brier on raw scores per rung (per ADR-021 + ADR-023)]. All reported with bootstrap CIs per ADR-022 + ADR-024 (cv_clt_ci on 12 (fold, seed) per-rung values for rank-based metrics; pool-rows-and-compute-once for per-row metrics; 10K @ seed=1 + 10K @ seed=2 stability check; >5% half-width flag).
Dual-policy operating-point columns (per ADR-025) — trained rungs only — gain one new headline column “FPR @ recall ≥ 99%” (verification policy operating point via TargetRecallSelector(0.99) on val); the existing R@FPR=1% column carries a footnote tagging it as the detection policy operating point via TargetFPRSelector(0.01) on val. Headline footprint per trained rung settles at: AUPRC | AUROC | R@FPR=0.1%* | R@FPR=1%† | R@FPR=5% | FPR@R≥99%† | ECE | Brier (* = ADR-021 0.1%-pooled-only volatility flag; † = dual-policy operating points). Reference rungs receive blank cells in the verification column with footnote pointing to the SPEC §4 dual-policy applicability lock (only trained rungs get dual-policy framing; reference scorers report recall@FPR pinpoints only with contamination caveats per ADR-018).
Recall@FPR pinpoint volatility surfacing (per ADR-021) — for the 0.1% pinpoint at pooled level: half-width column alongside point estimate; flag marker when half-width > 0.5 × point estimate; resample-degeneracy fraction emitted to evals/audit/per_rung_audit.json; per-resample threshold-drift dump to evals/audit/pinpoint_threshold_drift.json; methodology spoke explains why 0.1% reports wider CIs and is not computable per-slice. The 0.1% pinpoint is reported only at the pooled aggregation level (pooled n_neg ≈ 16-20K yielding 16-20 FPs at threshold); at per-slice or per-LODO-fold aggregation it is reported as “not computable at this aggregation level (n_neg too small)”.
Calibration battery composition (per ADR-023) — Headline: ECE-equal-mass(n_bins=15, quantile binning) + Brier on raw scores per rung. Spoke (WRITEUP/calibration.md): all 4 ECE variants from eval-toolkit (L1/L2 × plug-in/debiased) + Brier decomposition (refinement / reliability / uncertainty) + reliability diagrams (equal-mass quantile) + intervention deltas — temperature scaling (Guo 2017 1-parameter) + isotonic regression (non-parametric monotonic remapping); both calibrators fit on validation only per-(rung, fold, seed) per ADR-011 Guarantee 6; calibration interventions are monotonic and therefore do NOT change rank-based headline metrics (PR-AUC, ROC-AUC, recall@FPR).
5.2 Statistical tests
Stance: report effect sizes and CIs only. No p-values. The work characterises differences and their uncertainty rather than claiming significance.
Anchored to eval-toolkit primitives:
bootstrap_ci — per-metric finite-sample uncertainty. See eval-toolkit bootstrap methodology (see README).
paired_bootstrap_diff — paired comparisons across rungs on the same test set. See eval-toolkit comparison methodology (see README).
mde_from_ci — minimum detectable effect.
- Calibration battery (
reliability_curve, fit_temperature, fit_isotonic_calibrator, ECE variants, Brier). See eval-toolkit calibration methodology (see README).
cv_clt_ci — CLT-based CI for cross-fold variance.
Cross-fold CI methodology: [LOCKED: cv_clt_ci (Bayle 2020 Annals of Statistics Theorem 3.1 implementation at eval-toolkit src/eval_toolkit/bootstrap.py:963) headline + block-bootstrap-on-folds spoke ablation + conditional stratified-k-fold-within-LODO escalation if Phase 4 compute budget permits (per ADR-024)]. cv_clt_ci operates on the 12 per-(fold, seed) metric values yielded by ADR-022’s compute-per-(fold, seed)-then-aggregate rule for rank-based metrics. Block-bootstrap-on-folds spoke ablation directly addresses the LODO non-exchangeability concern (folds are not exchangeable — each fold holds out a different positive source with different size and attack-style character). Sensitivity-check flag: if block_bootstrap_CI_halfwidth / cv_clt_CI_halfwidth > 1.5 for any rung, methodology spoke names “LODO non-exchangeability dominates within-fold variance; headline CI may understate uncertainty” (assumption A-008). Bates 2024 JASA nested-CV + Nadeau-Bengio 2003 standalone correction explicitly deferred to afterword.
Paired-test method: [LOCKED: eval-toolkit paired_bootstrap_diff (Efron-Tibshirani 1993 §10.3 row-level pairing) per ADR-022; DeLong + McNemar + Cochran-Q rejected at the row level with multi-source-LODO-specific rationale (DeLong's asymptotic Gaussian assumption breaks at per-fold n ≈ 4-5K benigns; designed for AUROC only; produces p-value contradicting estimation-over-testing; LODO fold-blocking violates iid assumption)].
Multi-seed protocol (per ADR-022 + ADR-006 + ADR-016): [LOCKED: 3 seeds {42, 1337, 2025} paired across rungs; trained rungs 12 obs per rung (4 LODO folds × 3 seeds); reference rungs 4 obs per rung (4 folds × no seed dimension); trained-vs-trained pairing is row-level via paired_bootstrap_diff; trained-vs-reference pairing replicates reference scores across the 12 trained seeds (reference-side variance fold-only); rank-based metrics per-(fold, seed)-then-mean; per-row metrics pool rows across (fold, seed); recall@FPR thresholds per-(seed) from val; calibration interventions per-(rung, fold, seed); per-(rung, fold, seed) observations dumped to evals/audit/per_seed_observations.parquet per ADR-011 Guarantee 5].
Multi-comparison correction (per ADR-022 + ADR-006): [LOCKED: no formal correction applied; methodology spoke at WRITEUP/methodology.md gains "Family of comparisons" acknowledgment paragraph citing Gelman & Loken 2014 forking-paths + ASA 2016 statement on p-values]. Estimation-over-testing means correction does not apply (correction applies to significance-testing; we report effect sizes).
5.3 Operating points — detection vs verification
[LOCKED] Dual-policy framing on in-house rungs only. Reference scorers (off-the-shelf reference detectors) get recall@FPR pinpoints with explicit contamination caveats; no dual-policy framing (would imply deployment-ready operating points that don’t survive the contamination caveat).
[LOCKED: Detection — FPR ≤ 1% via eval_toolkit.TargetFPRSelector(0.01); Verification — FNR ≤ 1% (equivalently recall ≥ 99%) via eval_toolkit.TargetRecallSelector(0.99); per-(rung, fold, seed) fitting on validation only; 24 thresholds per trained rung × 4 trained rungs = 96 threshold-pair instances; paired_bootstrap_op_point_diff two-level bootstrap (refit per resample) for CI propagation; cost-weighted thresholding remains rejected per ADR-006 (no CostSensitiveSelector use); per ADR-025].
Headline integration: detection-policy operating point coincides numerically with the recall@FPR=1% headline pinpoint already locked in ADR-021 — captured as a footnote on the existing R@FPR=1% column. Verification-policy operating point gains one new headline column “FPR @ recall ≥ 99%” per trained rung (see §5.1).
Spoke: full dual-policy operating-point grid (4 trained rungs × 2 policies × {pooled-IID + pooled-OOD + 4 per-LODO-fold + 5 per-OOD-slice} aggregation levels = 80 cells per policy with paired_bootstrap_op_point_diff CIs) + Verification-target reachability across trained rungs subsection (per assumption A-009; honest infeasibility reporting via asterisk + audit JSON evals/audit/verification_reachability.json) + ≥3 deployment scenarios per ADR-006 + optional Recall-floor sensitivity sweep afterword regenerating verification operating points at recall floors {95%, 99%, 99.9%} from persisted predictions per ADR-013 (zero new training compute) — all in WRITEUP/threshold-policy.md. See eval-toolkit thresholds methodology (see README) for the eval-toolkit primitive surface.
5.4 Per-source and per-style breakdowns
Required for any OOD claim — aggregate metrics hide heterogeneity. Reported alongside the headline IID/OOD numbers. Per-style heuristic tagger (regex-based) is conservative; LLM-as-rater rubric audit dropped per ADR-050 (Phase 4 cost re-estimation showed envelope ~16× original ADR-018 estimate; see EVIDENCE.md §3).
5.5 Adversarial robustness
Largely deferred — named but not exhaustively probed. The threat model (paraphrase, encoded payloads, multi-turn injection, base64/leetspeak obfuscation) is named per ADR-014; what was not tested is named explicitly in WRITEUP §5.6 and §8.
Linked ADRs: ADR-021 (eval slate aggregation + recall@FPR pinpoints), ADR-022 (statistical inference apparatus — bootstrap N + multi-comparison + multi-seed + paired-test), ADR-023 (calibration battery — raw + temperature + isotonic), ADR-024 (cross-fold CI methodology — cv_clt_ci headline + block-bootstrap-on-folds spoke), ADR-025 (dual-policy threshold characterization — symmetric 1% targets + per-(rung, fold, seed) fitting + verification-reachability audit).
6. Code architecture
The work spans three repos:
prompt-injection-detection-prototype (this repo) — modelling: data loading, training, classification API, project-specific scoring code.
eval-toolkit — evaluation harness: metrics, bootstrap, calibration, threshold selection, leakage detection, slice-aware orchestration, reproducibility manifests, versioned JSON schemas.
runpod-deploy — cloud orchestration for training/eval runs on rented GPUs. the project’s additions: prediction-persistence pull-pattern + checkpoint upload-to-HF-Hub pattern.
The split is intentional: methodology curriculum and primitives live in eval-toolkit so they survive across iterations; cloud orchestration lives in runpod-deploy so it’s reusable across projects.
6.1 Module layout (per ADR-026)
[LOCKED: concern-grouped sub-packages under src/]
src/
data/ # loaders, dedup, LODO splits, manifest validation
training/ # ModernBERT loader, LoRA configurator, trainer
scoring/ # reference-scorer adapters (one module per scorer)
eval/ # calibration_battery, operating_points, slice_analysis
utils/ # config_hash, paths, logging glue
scripts/ # CLI entrypoints — argparse + IO; orchestrate src/ calls
configs/
runpod/ # canonical RunPod config per ADR-020
rungs/ # per-rung YAML hyperparameters per SPEC §5 config discipline
profiles/ # smoke vs canonical profile configs per ADR-027
data/ # source manifest with HF SHAs per ADR-016
tests/
conftest.py # marker registration + shared fixtures
test_invariants.py # 25+ tests-as-invariants per SPEC §5
fixtures/ # smoke-test fixture data (NOT real data)
unit/ # pytest -m unit
smoke/ # pytest -m smoke
integration/ # pytest -m integration
Boundaries — src/ is library code (importable, no side effects); scripts/ is entrypoint glue (argparse + IO; not importable); configs/ is YAML data; tests/ is verification. Adding or moving a top-level src/ sub-package requires a superseding ADR.
6.2 Smoke vs canonical separation (per ADR-027)
[LOCKED: three Makefile targets stratified by execution context]
make smoke |
laptop only |
no GPU |
no network |
<10 min |
dev debugging + reviewer “does this wire together” check |
make test-integration |
local GPU OR cloud pod |
GPU when available; skip gracefully when not |
optional |
~5-10 min |
dev debugging on workstation GPU; pre-flight smoke on cloud pod |
make headline-cloud |
RunPod (billed) |
H100/equivalent per ADR-020 gpu_order failover |
required |
hours; cost-cap-gated $125/job per ADR-020 + A-002 |
canonical evaluation deliverable — not a test |
Honest framing (required in WRITEUP/methodology.md): math-correctness validation lives upstream in eval-toolkit (≥90% coverage floor, Hypothesis property tests, golden-output snapshots, doctests on math kernels). The local test layer in this prototype repo is debugging-grade — sufficient to catch glue-layer breakage before paying for cloud compute, not sufficient to substitute for upstream library validation. Reviewers consult eval-toolkit’s test suite for math-correctness evidence.
A separate make headline-dry-run target exposes runpod-deploy run --dry-run standalone for cost preview without provisioning.
6.3 Linked ADRs
ADR-026 (module layout), ADR-027 (smoke vs canonical), ADR-028 (coverage floor), ADR-029 (test marker strategy).
8. SDD process notes
- Spec freeze: once this document is
LOCKED, changes require an ADR.
- Phase 0 interview:
[LOCKED] agent reads spec, surfaces decisions, human picks, decisions become ADRs. .
- Process gates, not outcome gates: phase gates check that work was done and tests pass — not that metrics hit a target. deliberately avoids tying phase movement to outcome numbers so that the eval reports what was found rather than what was needed to advance.
- Transcript capture:
[LOCKED] every session where decisions are discussed produces a transcript in transcripts/. .
- Prediction persistence:
[LOCKED] per-row predictions are persisted alongside metrics. runpod-deploy pulls per-row score artifacts so downstream analyses (calibration, threshold sweeps, ROC curves) run from persisted predictions without re-running inference.
- ADR cadence: one ADR per significant decision; format per Michael Nygard.
- Assumption updates: when an assumption is invalidated mid-implementation, update
assumptions.md and write a corrective ADR.
- Tests-as-invariants: every spec claim that can be made executable as a test, must be.
Linked ADRs: ADR-001, ADR-025, ADR-026, ADR-027, ADR-028, ADR-029, ADR-030, ADR-031, ADR-032, ADR-033, ADR-034, ADR-035, ADR-036, ADR-037, ADR-038, ADR-039, ADR-040, ADR-041, ADR-042, ADR-043.
9. Submission deliverables (Phase 0-07)
[LOCKED] Submission deliverables locked at Phase 0-07 — see ADR-030 (deliverable format = Quarto HTML site via GH Actions; supersedes ADR-002 PDF + repo) + ADR-031 (reviewer reading paths via index.qmd + sidebar nav; supersedes ADR-004 PDF-as-hub framing) + ADR-032 (HF Hub publication = headline rungs only with model card discipline) + ADR-033 (release strategy = v0.9.0-rc1 rehearsal + v1.0.0 submission + v1.0.x post-submission patches; CHANGELOG + _site.tar.gz release asset) + ADR-034 (reproducibility tier = full ladder T0 eval-from-hub + T1 smoke + T3 headline-cloud).
Reviewer email at submission carries three URLs + private attachment: 1. Source pin — https://github.com/brandon-behring/prompt-injection-detection-prototype/tree/v1.0.0 2. Live rendered Quarto site — https://brandon-behring.github.io/prompt-injection-detection-prototype/ 3. GH release page — https://github.com/brandon-behring/prompt-injection-detection-prototype/releases/tag/v1.0.0 4. Transcripts as private attachment per existing convention (gitignored).
Linked ADRs: ADR-030, ADR-031, ADR-032, ADR-033, ADR-034.