DeBERTa-v3-base medium ablation execution (v1.1.2 carryforward; narrow renaming-only supersession of ADR-060 “v1.1.1” body references) — null result, backbone-dominant interpretation
ADR-063 — DeBERTa-v3-base ablation v1.1.2 execution + slot-shift
Status
Accepted (2026-05-19; narrow renaming-only carryforward of ADR-060 v1.1.1 -> v1.1.2 + capture of the executed ablation result).
Context
ADR-060 landed the DeBERTa-v3-base medium ablation methodology lock at v1.1.0 with body text “execution deferred to v1.1.1”. The execution did not happen at v1.1.1 because that release-tag slot was consumed by ADR-061 (a user-feedback-driven Quarto site navigation restructure that took priority on calendar-day 2026-05-19).
The v1.1.2 handoff doc [PLAN_REF redacted; per ADR-068 Class B aspirational-upstream path; original at author’s local ~/.claude/plans/ which is gitignored] sequenced the execution as Phase A (generic loader refactor) -> Phase B (windowed-inference module) -> Phase C (training dispatch + Makefile) -> Phase D (GPU fires) -> Phase E (governance close) -> Phase F (transcript). The full v1.1.2 session executed Phases A-D end-to-end on 2026-05-19 and partial Phase E (this ADR + NEXT_STEPS + CHANGELOG; the wider RESULTS/WRITEUP/library_imports updates defer per the no-conflated-scopes discipline because the in-flight doc agent working on ADR-062 holds unstaged edits on those same files).
Decision
Narrow renaming-only supersession of ADR-060 body references
ADR-060 is immutable per project ADR-discipline. The body’s “v1.1.1 execution” wording is treated as a placeholder that was assigned without knowing the v1.1.1 slot would be needed for ADR-061. The DeBERTa execution carries forward to v1.1.2; commit messages document the shift. ADR-060’s methodology lock (single fold/seed; 2 truncation strategies; 5-slice OOD eval; ablation-appendix framing; ~$5-7 GPU envelope) is BINDING regardless of release-slot. No methodology content is changed by this carryforward.
v1.1.2 Phase A–D execution outcome
Phase A (commit a34128b): refactored src/training/load_modernbert.py to src/training/load_backbone.py with a generic hf_id kwarg so the same flash-attn-fallback recipe (per ADR-020) serves both ModernBERT (ADR-019) and DeBERTa-v3-base (ADR-060). Per the no-orphans invariant, the old loader was deleted in the same commit; 6 call sites + 3 test files updated.
Phase B (commit 9ecf0e3): added src/inference/windowed.py implementing chunk-and-average + head-truncation truncation strategies via HF tokenizer’s native sliding-window protocol (return_overflowing_tokens=True + stride); 15 mocked-only smoke tests at tests/smoke/test_windowed_inference_smoke.py. Reuses src.training.softmax_cast.softmax_fp32 for ADR-019 numerical stability.
Phase C (commit 2ed8e04): pinned DeBERTa-v3-base revision SHA 8ccc9b6f36199bec6961081d44eb72fb3f7353f3 via huggingface_hub.HfApi.model_info; wired training_strategy dispatch into train_modernbert.py (prepare_model + _write_predictions_parquet + PerEpochPredictionsCallback + train_one_cell); added VALID_RUNG_NAMES + VALID_TRUNCATION_STRATEGIES constants; extended train_rung.py with --truncation-strategy CLI override + DeBERTa rung dispatch; added --epoch-filter to run_metrics_battery.py; wired 3 Makefile targets (train-deberta-v3 / eval-deberta-v3 / deberta-ablation).
Phase D (commits 3791c1a through 898fae5; 8 commits total across the fix-cycle + the closing feat commit): fired both strategies on a single warm A100-SXM4-80GB pod (US-MD-1) via lifecycle.on_success: recycle. Pre-flight surfaced 7 distinct infrastructure errors before the training succeeded end-to-end:
83fd348— addedsentencepiecePython dep (DeBERTa-v3 tokenizer needs the SentencePiece backend).99501ba— narrowedheadline-deberta.yamlstaging excludes (later subsumed by33387b5; kept for audit-trail completeness).33387b5— load-bearing FUSE fix — moved/workspace(FUSE/MooseFS) ->/root(container overlay disk; POSIX semantics work) for project code + HF cache + secrets + run scripts + logs. Re-extends the X8 venv-on-/root workaround (perfuse-workspace-needs-uv-link-mode-copymemory) to all writable paths; without this the rsync push fails withInput/output error (5)on/workspace.f660f76— addedprotobufPython dep (transformers’SentencePieceExtractor.__init__callsrequires_backends(self, "protobuf")independently from the sentencepiece package).60fdc53— boundtruncation_strategyincheckpoint_dir_template.format()(the Phase C2 dispatch wiring missed this specific format site).aa91067— load-bearing numerical-stability fix — DeBERTa-v3- bf16 produces
loss=0+grad_norm=NaNfrom training step 1 (disentangled attention overflows bf16 mantissa). Madebuild_training_args+load_backboneaccept YAML-drivenbf16/fp16/learning_rate/num_train_epochsoverrides; switched DeBERTa to fp32. Locally validated forward+backward before re-firing (loss=0.7268+ 202/202 finite gradients on fp32 vsloss=NaN+ 0/202 finite on bf16).
- bf16 produces
67679a5— fixed checkpoint path doubling (deberta_v3_base.yamlcheckpoint_dir_templateprefix conflicted withtrain_rung.py --checkpoint-rootdefault); dropped the FUSE staging-bounce (--checkpoint-staging-root /root/training_stagingis unnecessary since we’re already on /root overlay disk).
The 8th commit (898fae5) closed Phase D with both training fires complete, 4 training-time predictions parquets, 10 OOD-inference predictions parquets (via the new scripts/run_deberta_ood_inference.py standalone OOD orchestrator, designed as a narrow companion to scripts/run_inference_battery.py since the DeBERTa checkpoint path has an extra strategy-nesting level), and the 6-row evals/metrics/per_cell_deberta.parquet.
Headline result (publishable null)
| strategy | jbb_behaviors AUPRC | xstest AUPRC | pooled_ood AUPRC |
|---|---|---|---|
chunk_and_average |
0.4855 | 0.3966 | 0.2912 |
head_truncation |
0.4890 | 0.3912 | 0.2895 |
Single-class slices (bipia, injecagent, notinject all-positive; iid all-negative for the LODO held-out source) correctly skip per ADR-006 single-class-slice handling (AUROC undefined -> pydantic validation nan-guard).
The 2 truncation strategies produce essentially identical per-slice metrics. By the ADR-060 confound-control interpretation: the ModernBERT advantage on the headline ladder is BACKBONE-DOMINANT, not context-window-dominant. Long-context (chunk-and-average) provides no measurable benefit over head-truncation on this 5-slice OOD slate. The interpretive caveat goes in the RESULTS §1B + WRITEUP/limitations-and-future-work.md §9.2 follow-up patch.
Cost reconciliation
Actual GPU spend across the 9 Phase D pod manifests (7 short failures averaging ~$0.05-0.20 each + 2 successful ~6-min training fires at ~$0.18 each): $1.34, well under the ADR-060 $5-7 expected envelope. Cumulative project spend: $9.92 / ADR-020 $200 hard cap.
Consequences
ADR-060 immutability preserved: this narrow renaming-only supersession captures the release-slot shift without rewriting ADR-060’s body text. Future readers cross-reference ADR-060 -> ADR-063 to understand why “v1.1.1 execution” actually landed at v1.1.2.
Methodology lock honored: the executed ablation matches every ADR-060 constraint (microsoft/deberta-v3-base; fold 0; seed 42; 2 epochs; 2 truncation strategies; full 5-slice OOD; ablation appendix; not a 6th rung). No methodology drift.
Publishable null result: the per-strategy headline directly resolves the long-context-vs-backbone-dominance question that motivated NEXT_STEPS §1.10 and the v1.1.x carryforward. The ModernBERT-vs-DeBERTa headline-ladder gap (per RESULTS §1) is attributed to backbone architecture, not context window. The reviewer can now read the ablation appendix in RESULTS §1B and conclude: “ModernBERT’s 8192-token window did not help on this slate; the win is backbone-architectural.”
Deferred Phase E artifacts: RESULTS §1B + WRITEUP §9.2 + library_imports.md inventory entries + SUBMISSION_AUDIT regeneration defer to a follow-up patch after the in-flight doc agent’s ADR-062 commit stabilises. The deferral is documented here + in
898fae5+61a09c9commit messages so the audit trail is complete even before the doc artifacts land.Library-first preserved:
src/inference/windowed.pyuses HF tokenizer’s native sliding-window protocol; no hand-rolled window-stride arithmetic. Reusessoftmax_cast.softmax_fp32(ADR-019 numerical stability). The chunk-and-average pattern is project-internal infrastructure (specific to the ModernBERT-vs-DeBERTa-v3 confound control), not a generic eval-toolkit primitive; no upstream MR filed.No methodology drift: the bf16 -> fp32 switch was a numerical- stability fix (DeBERTa-v3 + bf16 produces NaN gradients), not a methodology change. ADR-060 did not constrain the precision; the fp32 choice was made via the YAML’s
training.bf16: falseflag, withbuild_training_argshonoring the override per a backward- compatible plumbing change that preserves ADR-019 ModernBERT bf16 default.
Linked ADRs
- Referenced:
- ADR-060 — methodology lock; this ADR carries forward its execution landing condition to v1.1.2 (narrow renaming-only).
- ADR-061 — consumed the v1.1.1 slot that ADR-060 had body-text-named for DeBERTa execution.
- ADR-006 — single-class slice handling (AUROC nan-guard).
- ADR-019 — bf16 + lr=1e-4 default; this ADR’s
build_training_argsplumbing change preserves ADR-019 ModernBERT defaults while allowing DeBERTa fp32 overrides via YAML. - ADR-020 — cost envelope ($25 per-job soft cap honored; $1.34 actual).
- ADR-059 — runpod-deploy v0.8.4 lifecycle.on_success: recycle (the warm-pod 2-fire shape this ADR consumed).
- Source: v1.1.2 execution session per
transcripts/2026-05-19__v1-1-2-deberta-execution.md(private; emailed at submission).
Transcript
transcripts/2026-05-19__v1-1-2-deberta-execution.md — captures the full A-D sub-phase walkthrough including the 7-commit fix-cycle diagnosis + the load-bearing FUSE + fp32 fixes.