Hyperparameter disclosure
Reviewers may suspect cherry-picking. This document is the anti-cherry-pick defense: every hyperparameter value is locked before training begins (per SPEC_GREENFIELD.md §2 locked-process rule + ADR-019), and every non-default choice carries either a locking ADR or a literature-inheritance source. No val-set gridsearch ran during this submission.
The four sections below cover: the locked recipe (§1), what was actually swept (§2), what was deliberately not swept (§3), and budget-dependence caveats (§4).
1. Seed recipe (locked values)
All values are read at runtime from configs/rungs/*.yaml; mismatches between code defaults and YAML fail loud per ADR-026.
1.2 Frozen-probe specifics
Frozen backbone + trainable linear head over the [CLS] token. No rung-specific hyperparameters beyond the shared training recipe.
| Knob | Setting | Source / Rationale | ADR |
|---|---|---|---|
| Classifier head | linear (hidden_size → 2) |
sklearn-equivalent; locked architecture | ADR-017 |
| Backbone gradient | frozen | Defines the “frozen-probe” rung | ADR-017 |
1.3 LoRA specifics
LoRA adapters on the 4 ModernBERT attention/MLP linear modules per encoder layer × 22 encoder layers = 88 adapter modules. Trainable parameter ratio approximately 0.5 % to 1 % of total per ADR-019.
| Knob | Setting | Source / Rationale | ADR |
|---|---|---|---|
r (LoRA rank) |
8 |
Literature default for BERT-class models; not searched | ADR-019 |
alpha |
16 |
Standard 2 × r convention |
ADR-019 |
dropout |
0.1 |
LoRA paper default | ADR-019 |
target_modules |
[Wqkv, attn.Wo, mlp.Wo, mlp.Wi] |
ModernBERT attention/MLP linear modules | ADR-019 |
modules_to_save |
[classifier] |
Head trained alongside adapters | ADR-019 |
task_type |
SEQ_CLS |
PEFT sequence-classification task spec | ADR-019 |
bias |
none |
LoRA paper default | ADR-019 |
1.4 Full-FT specifics
All ModernBERT parameters trainable. No rung-specific hyperparameters beyond the shared training recipe + one storage-discipline relaxation.
| Knob | Setting | Source / Rationale | ADR |
|---|---|---|---|
cleanup_intermediate_checkpoints |
false (relaxed at X11 2026-05-17) |
Phase 5 X11 re-fire needs persisted checkpoints for OOD inference; relaxation is storage-not-methodology | not a methodology lock |
1.5 Classical floor (tfidf-lr)
| Knob | Setting | Source / Rationale | ADR |
|---|---|---|---|
| Word n-gram | 1..2 |
Standard bigram TF-IDF | ADR-017 |
| Word max features | 15000 |
Bounded vocab | ADR-017 |
| Char n-gram | 3..5 |
Subword robustness against tokenisation artifacts | ADR-017 |
| Char max features | 15000 |
Bounded vocab | ADR-017 |
sublinear_tf |
true |
Log-scaled term frequency | ADR-017 |
| LR solver | liblinear |
sklearn small-dataset default | ADR-017 |
LR C |
1.0 |
sklearn default; not searched | ADR-017 |
LR class_weight |
balanced |
Cross-rung uniform per ADR-019 | ADR-017 + ADR-019 |
LR max_iter |
1000 |
Fit-to-convergence; no epoch concept | ADR-017 |
2. Exploration trajectory — what was actually swept
Hyperparameter sweep ran during this submission: none.
Per the locked-process rule, hyperparameters were finalised at ADR-019 (Phase 0-04 close, 2026-05-15) before any training began. The values above were inherited from the ModernBERT classification literature + sklearn defaults; no val-set gridsearch executed.
What was swept (operating-point sweeps and stability checks, not hyperparameter tuning):
- Recall@FPR pinpoints at FPR ∈ {0.1 %, 1 %, 5 %} per ADR-046. Reported per-cell in
evals/metrics/per_cell.parquet. This is a reporting sweep, not a tuning sweep — same trained model, multiple thresholds. - Dual-policy threshold characterisation (Detection FPR ≤ 1 %; Verification recall ≥ 99 %) per ADR-025 — also a reporting sweep over val-fit thresholds, not a training sweep.
- Bootstrap seed stability check at
seed=2(Phase 7 commit26776dc) — 0 of 40 cells flagged at 5 % CI-halfwidth threshold. Captured atevals/bootstrap/paired_cells_seed2.parquet. This is a bootstrap-stability check, not a model retraining sweep. - Per-fold variance characterisation — 4 LODO folds × 3 training seeds (42, 43, 44) = 12 cells per (rung, slice, metric) per ADR-019. Variance is reported, not tuned against.
3. Axes held constant (intentional)
What was deliberately not searched, with rationale:
- Backbone choice — single-backbone lock at
answerdotai/ModernBERT-baseper ADR-015. Rationale: the rung-vs-rung comparison isolates capability level (frozen-probe vs LoRA vs full-FT) over a fixed backbone, not backbone choice itself. Multi-backbone comparison is out of scope per ADR-015 + WRITEUP §8. - Learning rate —
1.0e-4literature default for BERT-class classification, held constant cross-rung. Rationale: per-rung LR tuning would change the rung-vs-rung interpretation from “what does this capacity layer add” to “what does this capacity layer + its optimal LR add”; the latter is a different methodological contract. - LR schedule + warmup ratio — cosine + 0.10 warmup; literature default. Same rationale as LR.
- Epoch count — 2 epochs. Bounded by ADR-020 cost envelope; per ADR-019 the choice is “minimum-budget epoch count that achieves stable per-epoch parquet emission”. Higher-epoch sensitivity deferred to NEXT_STEPS §1.3.
- Effective batch size — 32. Bounded by single A100 80GB capacity for the longest tokenisation window (8192 tokens) under bf16.
- LoRA
randalpha—r=8,alpha=16literature defaults; not searched. Rationale:rsearch would multiply LoRA-rung training cost by 3-5× (small / medium / large rank) without a methodology contract reason — the rung label is “LoRA at literature-default rank”, not “LoRA at its capacity-optimal rank”. - LoRA
target_modules— locked at[Wqkv, attn.Wo, mlp.Wo, mlp.Wi]per the ModernBERT-base attention + MLP linear-module set. Rationale: alternate target sets (Wqkvonly;mlp.*only; etc.) would tilt the rung-vs-rung gap by changing what’s adapted; literature default is “all linear modules in attention and MLP”. - Tokenizer max_length — 8192 (ModernBERT native context). Truncating to a shorter window would distort row-level signal for the long-context prompt-injection corpus per ADR-014 Q4.
- TF-IDF
C— sklearn default 1.0 for the classical floor. Rationale: classical floor is a diagnostic anchor (per ADR-017 dual role), not a tuned-baseline; tuning would shift its interpretation. - Optimizer — AdamW (HF Trainer default) cross-rung. Not searched.
4. Caveats
- Budget dependence: the 3-seed × 4-fold = 12-cell-per-rung sample size is the minimum justifiable for the Bayle 2020
cv_clt_ciheadline CI machinery per ADR-024. Higher seed counts (e.g., 5-seed or 10-seed) would narrow CIs and could move some LoRA-vs-full-FT gap claims from “CI clears zero” to “CI does not clear zero” or vice versa. The bootstrap-seed stability check atseed=2(§2) tests this within the bootstrap layer; it does not test cross-seed variance at the training layer. - Hyperparameter inheritance vs project-locked: the recipe is inherited from ModernBERT literature + sklearn defaults, not tuned for this injection-detection task. A future iteration could justify a small per-rung LR sweep at a higher compute budget if the case-study contract is replaced by a deployment-orientated contract.
- Storage-discipline relaxation at X11: full-FT
cleanup_intermediate_checkpointsflipped fromtruetofalsemid- Phase-5 to enable OOD inference from persisted weights. This is not a methodology lock per the inline comment inconfigs/rungs/full_ft.yaml— the cleanup was a storage convenience, not an audit-relevant artifact. Relaxing it does not change any prediction parquet or metric value. - No val-set gridsearch: this submission deliberately cannot point to a tuning sweep that justifies a specific hyperparameter value as “optimal for this task”. The audit story is “locked from literature before training; no tuning happened” — which is the anti-cherry-pick position by design.
Audit hook
Every claim in WRITEUP.md that depends on a hyperparameter choice cross-references this file. The pairing makes the anti-cherry-pick story auditable: a reviewer can match each result row to its disclosed setting + verify against configs/rungs/<rung>.yaml at commit v1.0.0.
Linked ADRs: ADR-013 (eval downstream of train), ADR-014 (tokenizer locks), ADR-015 (single-backbone lock), ADR-017 (rung-slate + classical- floor), ADR-019 (transformer training recipe), ADR-020 (cost envelope + batch table), ADR-024 (cross-fold CI machinery), ADR-026 (config-hash discipline), ADR-044 (seed-slate partial supersession of ADR-019).