Prompt-injection classifier: what the evaluation found

60-second landing page for the prompt-injection classifier evaluation, with chooser between academic and narrative reading formats.
Author

Brandon Behring

Published

May 26, 2026

Prompt injection is untrusted text that tries to override the instructions an LLM system is supposed to follow. This project asks one narrow question: when several simple prompt-injection detectors are trained on direct instruction-override examples, can they detect that pattern, and do they still work when the attack family changes?

The short answer is: direct detection works better; cross-family generalization fails. Two trained detectors land below the 0.5 AUROC random floor on cross-family OOD — their rankings are anti-correlated with truth on attack classes they were not trained on. Only the frozen ModernBERT probe stays above the floor.

This page is the 60-second landing. Pick how you want to read the rest:


Pick your reading style

  • As a journal paper (academic IMRAD)WRITEUP_PAPER.md — Abstract, Introduction, Methods, Results, Discussion, Limitations, Conclusion, References. Formal voice; technical terminology with on-first-use definitions. ~20–25 min.
  • As a story (narrative arc)WRITEUP_NARRATIVE.md — Hook, Setup, Investigation, Revelation, Implications, Epilogue. Plain-English first-person prose. ~15–20 min.
  • As a hiring manager (60 seconds)Project at a glance — 4 questions: what problem, what found, why trust, how the candidate thinks.
  • As a reproducerWRITEUP/reproducibility.md
  • Just the dataRESULTS.md — exact tables + 5 canonical figures + raw artifact pointers, no narrative prose.

The README has the 1-page executive summary including headline numbers + mechanism + direct-detection check tables — see README.md#executive-summary.

Both reading-style guides cover the same content (problem, methodology, all 7 findings, mechanism interpretation, limitations); the register and pacing differ. Technical terms are defined on first use in either guide and cross-referenced to docs/GLOSSARY.md.


Headline pooled OOD AUPRC

Detector Pooled OOD AUPRC 95% CI Read
ModernBERT frozen probe 0.364 [0.354, 0.375] best in-house score, still at random floor (0.374)
ProtectAI v1* 0.361 [0.330, 0.391] reference scorer with verified training-pool overlap; not a clean OOD baseline
ProtectAI v2* 0.314 [0.283, 0.345] reference scorer with verified training-pool overlap; does not dominate v1
ModernBERT LoRA 0.293 [0.286, 0.301] trained adapter ranks below random; AUROC 0.383 below 0.5 floor
TF-IDF + LR 0.291 [0.283, 0.298] classical floor; AUROC 0.371 also below 0.5 floor

* ProtectAI v1 + v2 were trained on at least 2 of 4 LODO training-pool sources (deepset/prompt-injections, Lakera/gandalf_ignore_instructions) per EVIDENCE §1-2. Pooled OOD scores on slices that overlap with that training pool are not clean OOD baselines.

Random AUPRC on the pooled OOD slate is 412/1101 = 0.374. No detector clearly beats that floor. Under AUROC (0.5 random floor), the two trained adapters land below 0.5 with confidence intervals on the wrong side — the headline finding.

For the full executive summary (mechanism + direct-detection check tables + reproduction instructions), see README.md. For the same content as a paper or a story, pick a guide above.


Submission anchors

  • Current state: tree/v1.3.13 (2026-05-26) — live-site source
  • Original submission tag: tree/v1.0.0 (2026-05-18) — preserved as historical reviewer pin per ADR-033
  • HF Hub checkpoints: frozen probe, LoRA