DeBERTa-v3-base medium ablation methodology — chunk-and-average vs head-truncation × 5-slice OOD; infrastructure-only at v1.1.0; execution deferred to v1.1.1
Superseded on one or more axes by ADR-081. The body below retains its original prose per the ADR-073 immutability rule; the corrected position lives in the superseding ADR. See the Decisions index to navigate.
ADR-060 — DeBERTa-v3-base long-context ablation methodology (v1.1.0 infrastructure; v1.1.1 execution)
Status
Accepted (2026-05-19; methodology lock — infrastructure scaffolds land at v1.1.0; GPU execution deferred to v1.1.1 per Path B of the /exploring-options scope-mismatch resolution).
Context
NEXT_STEPS.md §1.10 was tagged v1.1.x at the v1.0.0 submission close — the only scope item explicitly identified as v1.1.x rather than v1.0.x in the §1 ladder. The motivation, per the §1.10 entry:
“DeBERTa-v3-base is the only widely-cited classical-transformer baseline left untested. ModernBERT’s 8192-token native attention window dominates DeBERTa-v3’s 512 cap on the IID slate, but the OOD slate skews toward shorter prompts in 3 of 5 slices — a tight medium-ablation could clarify whether the ModernBERT win is backbone-dominant or context-window-dominant. Two truncation strategies (chunk-and-average + head-truncation) are confound controls; neither claimed canonical.”
The /exploring-options 2026-05-19 6-batch sub-session 9 Q1 locked the medium ablation scope (Option 2 — single fold/seed training × 2 truncation strategies × full 5-slice OOD eval). ADR-060 codifies that lock.
Mid-execution scope-mismatch discovery: at v1.1.0 Commit 3 prep, the existing training pipeline turned out to be ModernBERT-specific. src/training/load_modernbert.py hardcodes MODERNBERT_BASE_HF_ID = "answerdotai/ModernBERT-base" and is called directly from train_modernbert.py. Generic backbone support never landed because the project trained on a single backbone — generic-loader infrastructure was YAGNI until DeBERTa.
The v1.1.0 plan estimated B5 (DeBERTa configs + training fires + results) at ~2.5h on the assumption that adding a new backbone was a config-only change. The realistic estimate is ~4-6h infrastructure + ~$5-7 GPU. The user (2026-05-19 AskUserQuestion) chose Path B: land the methodology + infrastructure scaffold at v1.1.0; defer execution to v1.1.1.
Decision
Methodology lock (binding at v1.1.0; survives the v1.1.1 execution carryforward)
The DeBERTa-v3-base medium ablation locked scope:
- Backbone:
microsoft/deberta-v3-base; pin SHA at v1.1.1 execution time (loaded viahuggingface_hub.snapshot_downloadfor reproducibility). - Training scope: 1 fold (fold 0), 1 seed (seed 42), 2 epochs (matches the ModernBERT recipes per ADR-019). NOT a full LODO grid; ablation appendix only.
- Truncation strategies (2 reported side-by-side as confound controls; neither claimed canonical):
- Chunk-and-average: tokenize the full text; emit 512-token windows with stride 256 (50% overlap); per-window forward pass; average per-window probabilities to obtain final
predicted_proba_class1. Captures long-context signal at the cost of window-edge artifacts. - Head-truncation: tokenize the full text; take the first 512 tokens (matches DeBERTa-v3 native window); standard forward pass; emit
predicted_proba_class1directly. Strict DeBERTa-v3 native behavior; loses information past token 512.
- Chunk-and-average: tokenize the full text; emit 512-token windows with stride 256 (50% overlap); per-window forward pass; average per-window probabilities to obtain final
- Eval slate: full 5-slice OOD (BIPIA + InjecAgent + JBB-Behaviors + XSTest + NotInject); matches the v1.0.0 headline OOD slate per ADR-021.
- Framing: ablation appendix in
RESULTS.md§1B; NOT integrated as a 6th rung in the headline ladder (per §1.10 literal language). - Compute envelope: 1×L4 or 1×A100; ~30 min wall per training fire; ~$5-7 GPU total for the sequential-single-pod 2-fire shape (per Q2 lock; budgeted under ADR-020 $125 per-job soft cap; current cumulative spend $8.58 → ~$13.58 after v1.1.1).
Infrastructure (lands at v1.1.0)
configs/rungs/deberta_v3_base.yaml— full hyperparameter recipe. Mirrors the ModernBERT rung yamls; adds atruncation_strategy:field (one ofchunk_and_averageorhead_truncation).configs/runpod/headline-deberta.yaml— RunPod orchestration config.lifecycle.on_success: recycleper Q2 lock (#90 consumption);budget.ssh_ready_timeout_sec: 600per #88 consumption; image pin matches ModernBERT configs;budget.cost_cap_usd: 25.0(well under ADR-020 $125 soft cap; generous for 2 training fires).Makefiletargetstrain-deberta-v3,eval-deberta-v3,deberta-ablationexist as stubs that exit 2 with a “v1.1.1 execution carryforward; see ADR-060” message + pointer to this ADR.RESULTS.md§1B placeholder section with the methodology lock + planned scope + “Results pending v1.1.1 GPU fire”.WRITEUP/limitations-and-future-work.md§9.2 documents the deferred execution.NEXT_STEPS.md§1.10 status → “methodology landed at v1.1.0 (ADR-060); execution v1.1.1”.
v1.1.1 landing condition (when execution lands)
- Loader refactor:
src/training/load_modernbert.py→src/training/load_backbone.py(generic; acceptshf_id+revisionparameters) OR add parallelsrc/training/load_deberta.py. Library-first invariant + no-orphaned-code invariant apply. - Trainer dispatch:
train_modernbert.pyextends to route per backbone (rename + refactor if generic-loader path chosen). - Chunk-and-average inference loop: new module
src/inference/windowed.pyimplementing the 512-token-window stride-256 forward-pass + score-averaging logic. - Eval pipeline integration:
scripts/run_inference_battery.py+src/eval/slice_analysis.pypickup DeBERTa rung outputs. - 2 RunPod fires via
make deberta-ablation(sequential single-pod 2-fire per Q2 lock); cost preview viamake headline-dry-runBEFORE billing;validate --check-image-registry(#97 consumption) is default. - Results land in
evals/predictions/deberta_v3_base_{chunk_avg,head_trunc}__fold0__seed42__*.parquet- aggregated
evals/metrics/per_cell_deberta.parquet.
- aggregated
RESULTS.md§1B placeholder replaced with real numbers (chunk-and-average AUPRC/AUROC × 5 slices + head-truncation AUPRC/AUROC × 5 slices, side-by-side comparison table).WRITEUP/limitations-and-future-work.md§9.2 updated with the ablation result; long-context-vs-backbone-dominance verdict.evals/cost_ledger.csvrecords the 2 fire costs (within ADR-020 envelope).- v1.1.1 tag + GH release + reviewer URL verification.
Consequences
- Reviewer experience: the methodology lock is reviewable immediately at v1.1.0 via this ADR + the placeholder RESULTS §1B section. A reviewer auditing the §1.10 carryforward sees the lock; the deferred execution is honest accounting.
- No methodology drift: future v1.1.1 (or later) execution must hew to the locked scope. Any deviation requires a new ADR (per project anti-pattern: “Mutating a locked decision without writing a superseding ADR”).
- Cost certainty: methodology lock has zero GPU cost; v1.1.0 is a $0 release. Execution cost is bounded by ADR-020 envelope.
- SDD discipline preserved: the spec (this ADR) lands BEFORE the implementation (v1.1.1). The implementation is bound to the spec, not vice versa. Common SDD anti-pattern (write implementation; back-fill ADR) is avoided.
- Library-first preserved: the v1.1.1 infrastructure work (loader refactor + windowed-inference module) follows the library-first invariant. If a primitive belongs upstream (
eval_toolkitor similar), file an upstream issue before any local hand-roll.
Linked ADRs
- Referenced: ADR-015 (backbone selection — the original ModernBERT-vs-DeBERTa decision); ADR-020 (RunPod cost discipline); ADR-050 (rung slate narrowing — full-FT OOD dropped; DeBERTa carryforward to v1.1.x); ADR-051 (v1.0.x carryforward governance — same Phase-0-to-v1.x carryforward pattern); ADR-059 (runpod-deploy v0.8.4 bump that enables
lifecycle.on_success: recycleconsumption per Q2 lock). - Source: /exploring-options 2026-05-19 batch 9 Q1 (methodology lock); /exploring-options 2026-05-19 Q2 lock (sequential single-pod execution shape); /exploring-options 2026-05-19 AskUserQuestion Path B lock (infrastructure-only v1.1.0 + execution deferred to v1.1.1).
Transcript
Decisions surfaced during the 2026-05-19 v1.1.0 planning session (/exploring-options 6-question execution-lock + Path B scope-mismatch resolution; transcript at transcripts/2026-05-19__v1-1-0-runpod-deploy-and-deberta.md).