Prompt-injection classifier: what the evaluation found
Prompt injection is untrusted text that tries to override the instructions an LLM system is supposed to follow. This project asks one narrow question: when several simple prompt-injection detectors are trained on direct instruction-override examples, can they detect that pattern, and do they still work when the attack family changes?
The short answer is: direct detection works better; cross-family generalization fails. Two trained detectors land below the 0.5 AUROC random floor on cross-family OOD — their rankings are anti-correlated with truth on attack classes they were not trained on. Only the frozen ModernBERT probe stays above the floor.
This page is the 60-second landing. Pick how you want to read the rest:
Pick your reading style
- As a journal paper (academic IMRAD) → WRITEUP_PAPER.md — Abstract, Introduction, Methods, Results, Discussion, Limitations, Conclusion, References. Formal voice; technical terminology with on-first-use definitions. ~20–25 min.
- As a story (narrative arc) → WRITEUP_NARRATIVE.md — Hook, Setup, Investigation, Revelation, Implications, Epilogue. Plain-English first-person prose. ~15–20 min.
- As a hiring manager (60 seconds) → Project at a glance — 4 questions: what problem, what found, why trust, how the candidate thinks.
- As a reproducer → WRITEUP/reproducibility.md
- the T0/T1/T3 tier ladder (~$0 / ~$0 / ~$125).
- Just the data → RESULTS.md — exact tables + 5 canonical figures + raw artifact pointers, no narrative prose.
The README has the 1-page executive summary including headline numbers + mechanism + direct-detection check tables — see README.md#executive-summary.
Both reading-style guides cover the same content (problem, methodology, all 7 findings, mechanism interpretation, limitations); the register and pacing differ. Technical terms are defined on first use in either guide and cross-referenced to docs/GLOSSARY.md.
Headline pooled OOD AUPRC
| Detector | Pooled OOD AUPRC | 95% CI | Read |
|---|---|---|---|
| ModernBERT frozen probe | 0.364 | [0.354, 0.375] | best in-house score, still at random floor (0.374) |
| ProtectAI v1* | 0.361 | [0.330, 0.391] | reference scorer with verified training-pool overlap; not a clean OOD baseline |
| ProtectAI v2* | 0.314 | [0.283, 0.345] | reference scorer with verified training-pool overlap; does not dominate v1 |
| ModernBERT LoRA | 0.293 | [0.286, 0.301] | trained adapter ranks below random; AUROC 0.383 below 0.5 floor |
| TF-IDF + LR | 0.291 | [0.283, 0.298] | classical floor; AUROC 0.371 also below 0.5 floor |
* ProtectAI v1 + v2 were trained on at least 2 of 4 LODO training-pool sources (deepset/prompt-injections, Lakera/gandalf_ignore_instructions) per EVIDENCE §1-2. Pooled OOD scores on slices that overlap with that training pool are not clean OOD baselines.
Random AUPRC on the pooled OOD slate is 412/1101 = 0.374. No detector clearly beats that floor. Under AUROC (0.5 random floor), the two trained adapters land below 0.5 with confidence intervals on the wrong side — the headline finding.
For the full executive summary (mechanism + direct-detection check tables + reproduction instructions), see README.md. For the same content as a paper or a story, pick a guide above.
Submission anchors
- Current state:
tree/v1.3.13(2026-05-26) — live-site source - Original submission tag:
tree/v1.0.0(2026-05-18) — preserved as historical reviewer pin per ADR-033 - HF Hub checkpoints: frozen probe, LoRA