Soft-signals naming discipline and external artifact engagement set

Published

May 15, 2026

ADR-012: Soft-signals naming discipline and external artifact engagement set

Status

Accepted (2026-05-15)

Context

Soft signals (Q5-C9) are reviewer-weighted hints the brief carries without literally mandating. Naming them in the writeup demonstrates careful reading. External artifact engagement (Q5-C10) is the analogous question for prior work: which papers, benchmarks, models does the submission engage with, and at what level? Failing to engage with cited or canonical artifacts reads as “didn’t do the homework”; over-engaging dilutes the methodology focus.

Both rows are brief-derived. The user verbally summarized that the brief is consistent with a methodology-first submission and ratified the default soft-signal list + default engagement set. This ADR records that ratification.

Decision

Part 1 — Eight soft signals, each named in the WRITEUP

Soft signal Where named Aligned by ADR
Calibration Headline metrics narrative (ECE is in the 4-metric headline) ADR-006
OOD honesty Data design narrative (source-disjoint LODO + NotInject inclusion) ADR-008
Reproducibility Process narrative (two-tier laptop + GPU canonical) ADR-009
Writing clarity Front-matter (hub-and-spoke structure made explicit) ADR-004
Engineering taste Process narrative (marker tests, uv.lock, pre-commit) ADR-009
Methodology > results Front-matter (Principle 1 cited from ADR-005) ADR-005
Time-budgeted craftsmanship Limitations narrative (fallback ladder discussion) ADR-001 + ADR-005
Honesty about limitations Limitations narrative (structured-limitations principle) ADR-005 Principle 3 + ADR-010

Naming discipline: at each relevant WRITEUP section, the prose explicitly cites the signal: “the brief emphasizes [X], so we [methodology-choice-Y]”. Shows the brief was read carefully and the methodology choices respond to it.

Part 2 — Default external artifact engagement set

Artifact Engagement level Implementation
Lakera Guard / ProtectAI LLM-Guard Compare against Reference-scorer rungs in the rung slate (per ADR-007); finalize model IDs in Phase 0-03 contingent on time.
JailbreakBench (Chao et al. 2024, NeurIPS D&B, arXiv:2404.01318) Cite + acknowledge Methodology section; not used as primary eval (different task: red-team vs detector).
HarmBench (Mazeika et al. 2024, ICML, arXiv:2402.04249) Cite + acknowledge Methodology section; same rationale as JailbreakBench.
InjecGuard / NotInject (Li & Liu 2024, arXiv:2410.22770) Replicate NotInject benign-trigger hard negatives included in OOD slate (per ADR-008); over-defense framing surfaced in WRITEUP.
BIPIA (Yi et al. 2023, arXiv:2312.14197) Compare against Indirect-injection OOD slice (per ADR-010 Bound 2: direct + indirect in scope).
InjecAgent (Zhan et al. 2024, ACL Findings, arXiv:2403.02691) Cite + acknowledge OOD stretch probe; agentic injection is out-of-primary-scope per ADR-010 Bound 2.
PromptShield (Microsoft 2024, provisional arXiv:2405.14478) Cite + acknowledge Influence on Recall@FPR=1% pinpoint choice (per ADR-006); cited in metrics-choice rationale.
OWASP LLM Top 10 Cite + acknowledge (conditional) Cite if industry-standard threat-model framing is relevant; otherwise omit.

Consequences

Positive:

  • Soft-signal naming discipline produces a writeup that reads as carefully-targeted-at-the-brief rather than generic-ML-submission. Differentiates from equivalently-rigorous submissions that don’t surface the brief-reading.
  • Default engagement set is dossier-grounded — every external artifact mapped here is in docs/research/MANIFEST.json and has a verified summary in the dossier files.
  • Engagement levels are concrete: “compare against” maps to rungs; “replicate” maps to OOD slices; “cite + acknowledge” maps to references. No ambiguity.

Negative / cost:

  • Eight soft signals × one explicit naming each = eight prose hooks to write into the WRITEUP. Bounded.
  • Default engagement set risks over-citing if the brief doesn’t actually emphasize all listed artifacts. Mitigation: the engagement levels are calibrated (most are “cite + acknowledge”, which is cheap); only Lakera/ProtectAI, NotInject, and BIPIA carry implementation cost (already locked in ADR-007/008/010).

Neutral:

  • If during Phase 1-4 the brief surfaces additional soft signals or cited artifacts that the user didn’t recall in the verbal summary, a superseding ADR would record the update.

Alternatives Considered

  • Leave soft signals implicit (in methodology choices but not named in the writeup): Cheaper. Rejected because explicit naming is the methodology-first move; differentiator for an A2 reviewer.
  • Smaller engagement set (e.g., only Lakera + JailbreakBench): Smaller paperwork. Rejected because the dossier-grounded artifacts each correspond to a load-bearing comparison (over-defense via NotInject; indirect-injection via BIPIA; metrics-influence via PromptShield) — dropping any narrows the methodology story.
  • Speculative inclusion of OWASP LLM Top 10 / NIST AI RMF unconditionally: Demonstrates industry-standard awareness. Rejected because over-claiming without brief signal is the failure mode the soft-signal discipline guards against. Conditional inclusion is the right stance.

References

  • JailbreakBench (Chao et al. 2024 NeurIPS D&B) — https://arxiv.org/abs/2404.01318
  • HarmBench (Mazeika et al. 2024 ICML) — https://arxiv.org/abs/2402.04249
  • InjecAgent (Zhan et al. 2024 ACL) — https://arxiv.org/abs/2403.02691
  • InjecGuard / NotInject (Li & Liu 2024) — https://arxiv.org/abs/2410.22770
  • BIPIA (Yi et al. 2023) — https://arxiv.org/abs/2312.14197
  • PromptShield (Microsoft 2024 provisional) — https://arxiv.org/abs/2405.14478
  • HackAPrompt (Schulhoff et al. 2023 EMNLP) — https://arxiv.org/abs/2311.16119
  • OWASP LLM Top 10 — https://owasp.org/www-project-top-10-for-large-language-model-applications/
  • docs/research/MANIFEST.json — dossier index
  • ADR-001 (time-budgeted craftsmanship — fallback ladder)
  • ADR-005 (methodology > results; honesty about limitations)
  • ADR-006 (calibration in headlines; PromptShield influence)
  • ADR-007 (Lakera/ProtectAI reference rungs)
  • ADR-008 (OOD honesty via LODO + NotInject; BIPIA in OOD slate)
  • ADR-009 (reproducibility + engineering taste)
  • ADR-010 (honesty about limitations via structured extension conditions)