prompt-injection-detection-prototype
A methodology-focused evaluation of prompt-injection detectors under cross-family distribution shift. Asks one question: when detectors trained on direct prompt-injection examples meet attack families they didn’t see, do they still work?
Pick a guide for the full methodology — both cover the same content: WRITEUP_PAPER.md (academic IMRAD, ~20–25 min) or WRITEUP_NARRATIVE.md (narrative arc, ~15–20 min). The executive summary below is the 1-page distillation; pick a guide for the full read.
Executive summary
This project evaluates prompt-injection detectors under out-of-distribution (OOD) shift. Prompt injection means untrusted text trying to override an LLM system’s instructions. The project is not trying to ship a production detector; it is trying to show what a fairer evaluation says about several detector designs.
Bottom line, two-sided:
- Direct detection is learned, and learnable cheaply. TF-IDF + LR reaches 0.971 AUPRC on balanced direct+benign validation; LoRA matches at 0.974. The neural lift over the lexical baseline is small.
- Cross-family generalization fails. On pooled OOD, the best detector lands at AUPRC 0.364 against a random floor of 0.374 — at the floor, not above. Under AUROC, LoRA (0.383) and TF-IDF (0.371) both clear the 0.5 floor on the wrong side: their rankings are anti-correlated with truth on cross-family attacks. The frozen ModernBERT probe alone stays just above floor (AUROC 0.515, 95% CI [0.505, 0.525] — lower bound clears 0.5 by only 0.005).
Pooled OOD AUPRC table
| Detector | Pooled OOD AUPRC | Read |
|---|---|---|
| ModernBERT frozen probe | 0.364 [0.354, 0.375] | best in-house score, still at random floor |
| ProtectAI v1* | 0.361 [0.330, 0.391] | reference scorer with verified training-pool overlap; not a clean OOD baseline |
| ProtectAI v2* | 0.314 [0.283, 0.345] | reference scorer with verified training-pool overlap; does not dominate v1 |
| ModernBERT LoRA | 0.293 [0.286, 0.301] | trained adapter ranks below random; AUROC 0.383 below 0.5 floor |
| TF-IDF + LR | 0.291 [0.283, 0.298] | classical floor; AUROC 0.371 also below 0.5 floor |
* ProtectAI v1 + v2 were trained on at least 2 of 4 LODO training-pool sources (deepset/prompt-injections, Lakera/gandalf_ignore_instructions) per EVIDENCE §1-2. Pooled OOD scores on slices that overlap with that training pool are not clean OOD baselines.
For pooled_ood, random AUPRC is 412 / 1101 = 0.374. The 0.364 frozen-probe score is therefore not a success claim. Under AUROC (random floor 0.5), LoRA and TF-IDF + LR both land below the floor with CIs that clear 0.5 on the wrong side — the mechanism is lexical overfitting + a label-relevance shift on the OOD slate (see §Mechanism below).
Mechanism: lexical overfitting + label-relevance shift
Two detectors land below the 0.5 AUROC random floor on pooled OOD with CIs that clear 0.5 on the wrong side: LoRA at 0.383 [0.374, 0.392] and TF-IDF + LR at 0.371 [0.362, 0.381]. A score below 0.5 AUROC is not pure overfitting (which predicts collapse toward random, not past it); it is lexical overfitting combined with a label-relevance shift on this specific slate:
- LoRA + TF-IDF both learn lexical signatures of direct injection (“ignore previous instructions”, “you are now”, etc.).
- NotInject (n=339, all negative): benign text engineered to look like direct injection. Both detectors score these HIGH (false positives).
- BIPIA + InjecAgent (indirect + agentic, n=112): real attacks that do not use direct-injection lexical patterns. Both detectors score these LOW (false negatives).
The lexical signal is real and consistent within itself — it just stops tracking attack class on cross-family slices where the lexical and semantic labels diverge. The frozen ModernBERT probe (zero LODO-pool adaptation) stays at 0.515 AUROC; generic linguistic features are less aligned with the direct-injection lexical distribution and therefore less inverted on the cross-family slate.
Generalization gap: in-pool 0.99 AUROC → cross-family 0.38 AUROC, ~0.6 drop for the trained detectors; frozen probe’s gap is 0.91 → 0.515, ~0.4 drop. The more training adapted to the LODO pool, the harder the cross-family fall.
Direct detection check
The OOD result should not be read as “nothing worked.” The detectors learned the direct prompt-injection task; they then failed to generalize cleanly across attack families.
| Detector | Direct+benign validation AUPRC | AUROC | Recall@0.5 |
|---|---|---|---|
| ModernBERT LoRA | 0.974 | 0.993 | 0.934 |
| TF-IDF + LR | 0.971 | 0.992 | 0.930 |
| ModernBERT frozen probe | 0.653 | 0.907 | 0.849 |
| Detector | LODO held-out direct-source recall@0.5 |
|---|---|
| ModernBERT frozen probe | 0.641 |
| ModernBERT LoRA | 0.625 |
| ModernBERT full fine-tune** | 0.558 |
** Full-FT shows LODO direct-source data only (24 Phase 2 predictions persisted); the comparable pooled OOD inference was not run (Phase 5 X11 crash, see ADR-075). Full-FT is absent from the Pooled OOD table above for that reason.
This is a capability characterization, not a deployment recommendation. The artifact’s contribution is the honest evaluation harness plus the negative result on cross-family transfer.
→ Continue with the academic paper (~20–25 min) or the narrative (~15–20 min). Both cover the same methodology, findings, and limitations in different reading styles.
Pick a guide for the full methodology
Pick the format that fits how you want to read this:
- Academic paper format (IMRAD) → WRITEUP_PAPER.md — formal Abstract / Introduction / Methods / Results / Discussion / Limits / Conclusion / References (~20–25 min)
- Narrative format (story) → WRITEUP_NARRATIVE.md — plain-English first-person 5-act story arc (~15–20 min)
- 60-second tour → Project at a glance
- Just the data → RESULTS.md — exact tables + 5 canonical figures + raw artifact pointers
- Reproduce → T0 laptop / T1 smoke / T3 cloud tier ladder (~$0 / ~$0 / ~$125)
Both guides cover the same content; the style is the difference. Jargon is defined on first use in either guide and cross-referenced to docs/GLOSSARY.md.
Below the fold — what was tested, why trust the result, reproduction quickstart, repo map
What this project is
- A capability characterization, not a deployment recommendation.
- A detector-ladder evaluation: TF-IDF + LR, ModernBERT frozen probe, ModernBERT LoRA, and ProtectAI v1/v2 references. (Full fine-tune ran for LODO direct-source only; pooled OOD inference crashed — see ADR-075.)
- A held-out family evaluation: indirect injection, agentic-flow injection, jailbreak-style questions, and benign-but-injection-shaped text.
- A reproducible artifact: source-disjoint splits, leakage checks, persisted predictions, confidence intervals, calibration metrics, and Quarto-rendered documentation.
What “OOD” means here
“OOD” = cross-family, not just a new source name. The training pool is direct-injection-heavy (4 sources: deepset, Gandalf, Mosscap, HackAPrompt). The held-out OOD slate covers:
- Indirect injection (BIPIA) — payload arriving through document context
- Agentic-flow injection (InjecAgent) — payload split across tool-use turns
- Jailbreak-as-question (JBB-Behaviors, XSTest) — harmful elicitation framed as questions
- Benign-but-injection-shaped (NotInject) — false-positive robustness
That mismatch is the experiment.
Why trust the result
- Held-out at the source level (LODO), not just the row level. Leave-one-dataset-out splits ensure the test slate doesn’t share a source with training.
evals/leakage_report.jsonreports zero exact-hash overlaps and zero cosine-≥0.85 overlaps across all (train, val, test) per-fold-seed pairs. - Every reported number carries a 95% bootstrap CI. Effect-size + CI reporting, not p-values. BCa bootstrap with 10000 resamples; seed-stability check at second seed.
- Honest single-class slice handling. BIPIA + InjecAgent (all-positive) and NotInject (all-negative) have mathematically undefined AUPRC/AUROC; the metrics pipeline filters them at source. Recall-at-threshold is reported on those slices instead.
- Reference scorer contamination is audited, not assumed. EVIDENCE §1-2 documents that ProtectAI v1 + v2 were trained on ≥2 of 4 LODO training-pool sources; their pooled OOD scores on overlapping slices are upper-bound, not clean OOD.
Reproduce — three tiers
git clone https://github.com/brandon-behring/prompt-injection-detection-prototype
cd prompt-injection-detection-prototype
make install
# T1 — laptop smoke (~$0, <10 min)
make test-smoke
# T0 — score-match against published HF Hub checkpoints (~$0, ~20 min)
make eval-from-hub RUNG=frozen-probe
make eval-from-hub RUNG=lora
# Score-matches against evals/results.json within 1e-4 absolute tolerance per ADR-058.
# T3 — full retraining from scratch on cloud GPU (~$125, hours)
make headline-cloud # cost-capped per ADR-020HF Hub checkpoints: BBehring/prompt-injection-frozen-probe · BBehring/prompt-injection-lora
Other useful targets:
make site # render Quarto site locally
make audit # regenerate/check ADR-derived submission audit
make render-figures # render canonical F1-F5 figures from evals/How this project thinks
- Spec-driven development — 81 immutable Architecture Decision Records under
decisions/lock methodology choices before code lands. - Library-first invariant — shared evaluation primitives live in upstream libraries (eval-toolkit, runpod-deploy, research_toolkit); local code is project-specific glue. Upstream gaps land in
decisions/upstream_issues.mdbefore any local workaround. - Confound-control discipline — when the headline result raised the natural follow-up (“does a longer context window fix the OOD gap?”), a controlled DeBERTa-v3-base ablation was designed (chunk-and-average vs head-truncation). Result: a publishable null. See ADR-060.
Repository map
| Path | Contents |
|---|---|
index.qmd |
first-reader landing page |
WRITEUP_PAPER.md |
academic IMRAD article (~20–25 min) |
WRITEUP_NARRATIVE.md |
narrative story-arc article (~15–20 min) |
RESULTS.md |
exact tables, 5 canonical figures, raw artifact pointers |
WRITEUP.md |
1-page router pointing at the two guides |
WRITEUP/ |
8 detailed methodology spokes (deep-dive references) |
EVIDENCE.md |
external-evidence audit trail |
NEXT_STEPS.md |
future-work surface |
decisions/ |
81 ADRs documenting methodology + governance |
evals/ |
metrics, bootstrap CIs, operating points, per-row predictions |
docs/plots/ |
F1-F5 figures + metadata sidecars (provenance trail) |
docs/GLOSSARY.md |
term definitions referenced by both guides |
notebooks/ |
static-rendered Jupytext appendices |
src/, scripts/, tests/ |
implementation + verification |
Key terms (quick reference)
- AUPRC — primary ranking metric; random floor equals the positive rate.
- OOD — out-of-distribution. Here the important shift is cross-family, not just a different source name.
- LODO — leave-one-dataset-out. One source is held out while the detector is trained on the others.
- FPR — false-positive rate. A 1% FPR target means no more than one false alarm per 100 benign examples.
- ECE/Brier — calibration errors; lower is better.
More definitions in docs/GLOSSARY.md.
What it does not claim
Single-turn English text classification only. Not in scope: multilingual attacks, encoded payloads (base64/leetspeak/Unicode confusables), paraphrase robustness, adversarial perturbations, full multi-turn system behavior, deployment threshold recommendations. The reference-scorer audit §5.6 names the threat-model deferrals explicitly.
Submission anchors
- Current state:
tree/v1.3.13(2026-05-26) — live-site source - Original submission tag:
tree/v1.0.0(2026-05-18) — preserved as historical reviewer pin per ADR-033 - Live rendered site: https://brandon-behring.github.io/prompt-injection-detection-prototype/
- HF Hub checkpoints: frozen probe, LoRA
