Phase 2 training implementation bundle — seed slate reconciliation + manifest move + classical-floor location + YAML config schema + trainer split + per-rung orchestration + fixture preflight
ADR-044: Phase 2 training implementation bundle
Status
Accepted (2026-05-16). Body-text partially supersedes ADR-019 for the seed slate only — the 3-tuple (42, 1337, 2025) is replaced by (42, 43, 44) to align with Phase 1 materialization per ADR-041; the rest of ADR-019 (recipe + LoRA config + TrainingArguments + WeightedTrainer + per-epoch save discipline + fp32 softmax cast) is preserved unchanged. ADR-019 frontmatter status remains Accepted (partial-supersession pattern matches the body-text-only convention established by ADR-043’s relationship to ADR-016 Q5).
Context
Phase 2 (Training) was unblocked at Phase 1 close (commit 496c085 + push to origin/main, 2026-05-16). The Phase 2 walkthrough surfaced three pre-lock inconsistencies that needed resolution at Phase 2 entry, plus four genuine implementation choices that the pre-locks left open.
Three pre-lock inconsistencies
Seed slate divergence: ADR-019 line 99 specified
seed=42iterated across(42, 1337, 2025)per ADR-006’s three-seed floor. ADR-041 (Phase 1 implementation bundle) andsrc/data/splits.py::SEEDSmaterialized the 12 splits at(42, 43, 44)— a different arbitrary 3-tuple at the same ADR-006 floor. Phase 2’s trainer must loaddata/processed/fold-N/seed-N/(train|val|test).parquet, so the trainer cannot use ADR-019’s slate without re-materializing the data layer.Source manifest location: Phase 1 implementation placed the manifest at
data/source_manifest.yaml. ADR-026 (concern-grouped 5-subpackage layout, Phase 0-06) explicitly placed the manifest atconfigs/data/source_manifest.yamland reserveddata/as non-committed HF cache + processed parquets. Phase 1’s choice violated ADR-026’s reservation.Classical-floor module location: ADR-017 line 92 referenced
src/rungs/tfidf_lr.pyas a Phase 1 deliverable. ADR-026 locked 5 subpackages(data, training, scoring, eval, utils)— nosrc/rungs/. The classical floor is a trained rung; it belongs insrc/training/.
Four implementation choices left open
ADR-019 + ADR-020 lock the training recipe (hyperparameters + GPU failover + adaptive batch + cost cap) at code-snippet level. ADR-026 locks the module layout. ADR-027 locks the smoke/canonical separation. What remained open at Phase 2 entry:
- YAML config schema — what lives in
configs/rungs/<rung>.yaml? - Trainer architecture split — single
trainer.pywith rung-mode branching, or split by stack? - Orchestration granularity — 48 runs total; single sweep, per-cell, or per-rung?
- Phase 2
make smokeextension — what does ADR-027’s deferred fixture-pipeline wiring look like in practice?
The /exploring-options Phase-2 walkthrough generated 7 numbered questions covering both surfaces; the user ratified the 7 recommendations together (Phase 1 precedent). This ADR locks all seven; subsequent Phase 2 commits implement them.
Decision
Q1 — Seed slate (42, 43, 44) (partial supersession of ADR-019)
ADR-019 line 99 read seed=42 # iterated across ADR-006 slate (42, 1337, 2025). This ADR replaces the parenthetical with (42, 43, 44) to match Phase 1’s materialized splits. Both slates are arbitrary at the ADR-006 three-seed floor — neither has methodology-specific properties. Re-materializing splits would invalidate evals/leakage_report.json, evals/data_audit.json, evals/contamination_scan.json, 36 train/val/test parquets, and 36 index-mask parquets without methodology gain; preserving Phase 1’s slate is the discipline-correct fix-forward.
The rest of ADR-019 is preserved unchanged: LoRA config (r=8, alpha=16, dropout=0.1, explicit target_modules enumeration), TrainingArguments (lr=1e-4, warmup_ratio=0.10, cosine schedule, 2 epochs, bf16, max_grad_norm=1.0, weight_decay=0.01, AdamW), data collator (max_length=8192, pad_to_multiple_of=8, dynamic padding + head-truncation), WeightedTrainer (sklearn-style class_weight balanced per-fold), per-epoch save discipline (epoch-2 headline; epoch-1 diagnostic; full-FT intermediates not persisted), and fp32 cast before final softmax.
Q2 — Source manifest moved to configs/data/source_manifest.yaml
git mv data/source_manifest.yaml configs/data/source_manifest.yaml (this commit). Updates 10 path references across src/data/, scripts/, tests/, Makefile, SPEC_SHEET.md, SPEC_GREENFIELD.md, assumptions.md, .gitignore. The data/ directory is now reserved per ADR-026 for non-committed artifacts (raw HF cache + processed parquets + dedup-holdout JSONL + contamination-templates parquet); .gitignore comment updated accordingly.
Q3 — Classical floor at src/training/tfidf_lr.py
ADR-026 (later than ADR-017) is the authoritative layout lock. The 5-subpackage taxonomy has no src/rungs/. The classical floor is a trained rung, so it belongs in src/training/. ADR-017’s reference to src/rungs/tfidf_lr.py was a speculative path naming pre-ADR-026; this ADR ratifies the canonical location.
Q4 — Per-rung YAML schema (canonical source of truth)
Each configs/rungs/<rung>.yaml (4 files: classical_floor.yaml, frozen_probe.yaml, lora.yaml, full_ft.yaml) carries:
- Header comment:
# Locked per ADR-019 (transformer rungs) or ADR-017 (classical floor). Do not edit without superseding ADR. - Identity:
rung_id,rung_label,classifier_type(one of:classical,frozen_probe,lora,full_ft) - Recipe: all training hyperparameters per ADR-019 / ADR-017 (mirrored declaratively so the YAML is the config-hash source of truth per SPEC §5 + ADR-026 line 33)
- Output paths:
predictions_dir_template(uses{rung},{fold},{seed},{epoch}placeholders);checkpoint_dir_template
The trainer code reads the YAML and instantiates the recipe; runtime assertion that YAML values match ADR-019 / ADR-017 locked defaults (fail-loud on drift). The “canonical YAML” pattern means a config-hash derived from the YAML uniquely identifies the recipe — required for the SPEC §5 invariant.
Q5 — Trainer split by stack
src/training/train_modernbert.py— 3 transformer rungs (frozen-probe + LoRA + full-FT). HF Trainer + WeightedTrainer per ADR-019. Single file with mode dispatch onclassifier_type.src/training/train_classical.py— sklearn TF-IDF + LR per ADR-017. Separate file; sklearn stack is unrelated to HF stack.
Both expose the same callable interface: train(config: RungConfig, fold: int, seed: int) -> Path returning the predictions parquet path. This satisfies ADR-026 line 73’s “multi-rung trainer” language by treating train_modernbert.py as the multi-rung trainer for the transformer slate (the original ADR-026 wording was authored when classical-floor placement was assumed at src/rungs/).
Q6 — Per-rung orchestration
scripts/train_classical_floor.py— runs the 12 classical cells on CPU locally; sklearnLogisticRegressionper ADR-017; near-zero cost; ~5 min wall-clock.scripts/train_rung.py --rung {frozen_probe|lora|full_ft}— sweeps 12 cells (3 seeds × 4 LODO folds) per invocation on a GPU pod via runpod-deploy. Each rung is one pod-job.
Total: 4 orchestrator invocations (1 local CPU + 3 GPU). Failure isolation: if lora fails mid-sweep, only that rung re-runs. Cost-cap maps cleanly to the per-rung budget (frozen-probe + LoRA < full-FT; per-rung RunPod configs sized accordingly).
Q7 — Phase 2 make smoke extension
configs/profiles/fixtures.yaml carries the smoke profile — 4 sources × ~50 examples each × 1 LODO fold × 1 seed × 1 epoch (overrides default 4×3×2 cells). tests/fixtures/ carries the synthetic parquets. make smoke (already partially wired per Phase 1) is extended to:
smoke: test-smoke
uv run python scripts/train_classical_floor.py --config configs/profiles/fixtures.yaml
uv run python scripts/train_rung.py --rung frozen_probe --config configs/profiles/fixtures.yamlTotal wall-clock target: under 5 minutes on a laptop CPU (no GPU, no network). Validates the full Phase 2+ code path end-to-end without GPU before paying for cloud time. Closes ADR-027 line 75 deferred wiring.
Consequences
Positive
- All 7 implementation choices are auditable in a single ADR; subsequent commits cite ADR-044 Q-N for specific decisions.
- Seed reconciliation preserves Phase 1 artifacts (no re-materialization cost).
- Manifest move + classical-floor location honor ADR-026 layout lock (5-subpackage taxonomy intact).
- Per-rung orchestration enables per-rung resumability + per-rung cost caps + per-rung GPU class choices.
- Canonical YAML schema is required for the SPEC §5 config-hash invariant; “YAML as source” pattern satisfies it.
- Fixture preflight closes ADR-027 deferral and provides a 5-minute reviewer-friendly verification path.
Negative / cost
- ADR-019’s
(42, 1337, 2025)seed slate is body-text-superseded; the partial-supersession pattern (no frontmattersuperseded_byfield) follows ADR-043’s precedent and requires the reviewer to read ADR-044 to see the divergence. - 10 path references updated across the repo for the manifest move; small but real change surface.
- Per-rung orchestration adds 4 CLI entrypoints (3 transformer rungs + 1 classical floor) where a single sweep would be 1; modest complexity for the failure-isolation gain.
- Fixture preflight requires
tests/fixtures/*.parquetsynthetic data — generated lazily or committed; either way adds repo churn.
Neutral
- ADR-019 recipe (hyperparameters + WeightedTrainer + per-epoch save) preserved unchanged.
- ADR-020 compute infrastructure (GPU failover ladder + adaptive BATCH_TABLE + flash-attn fallback + dual-layer cost cap) preserved unchanged.
- ADR-026 5-subpackage layout preserved unchanged (in fact, more strictly honored).
- ADR-027 three-target Makefile preserved unchanged; the smoke target gains the fixture-pipeline wiring it always deferred to Phase 1+.
Alternatives Considered
Q1 alternatives
- Re-materialize splits at
(42, 1337, 2025)— rejected; invalidates all Phase 1 evals JSONs + 36 parquets + 36 masks without methodology gain. - Quietly switch trainer seeds — rejected per CLAUDE.md anti-pattern (no mutation of locked decision without superseding ADR).
Q3 alternatives
- Add
src/rungs/via superseding ADR to ADR-026 — rejected; adds a sixth subpackage with one occupant for a non-load-bearing reason; ADR-026’s 5-subpackage taxonomy is sufficient. src/scoring/tfidf_lr.py— rejected;src/scoring/per ADR-026 line 74 is for reference-scorer adapters (inference-only wrappers around external models); TF-IDF + LR is a trained rung in our slate, not a reference scorer.
Q4 alternatives
- Minimal YAML (selectors only, hyperparams in code) — rejected; SPEC §5 config-hash invariant requires YAML to be canonical; minimal YAML means code is canonical and config-hash is meaningless.
- Full snapshot YAML with runtime
assert config == code_defaults— rejected; the assertion is the wrong direction (code defaults are not the lock; YAML is).
Q5 alternatives
- Single
trainer.pywith 4-way dispatch onclassifier_type— rejected; sklearn stack and HF stack share no primitives; one file would be 400+ LoC with two disjoint code paths. - Four separate trainer files (one per rung) — rejected; the three transformer rungs share the HF Trainer + WeightedTrainer + LoraConfig recipe; splitting them duplicates the recipe.
Q6 alternatives
- Single sweep CLI
scripts/train_all_rungs.py— rejected; all-or-nothing 10+ hour wall-clock; failure mid-sweep loses progress. - Per-cell CLI
scripts/train_one.py --rung X --seed Y --fold Z— rejected; runpod-deploy does not natively job-array; 48 separate pod invocations adds significant overhead vs 4 sweeps.
Q7 alternatives
- Smoke tests only (
pytest -m smoke) — rejected; doesn’t validate the trainer code path end-to-end before paying for GPU time. - Fixture-data 1-epoch trial without full pipeline — rejected; partial smoke leaves the eval JSON shape unverified.
References
- ADR-015 — Rung architecture refinement (ModernBERT-base single-backbone trained transformer slate)
- ADR-017 — Trained-rung slate expansion (TF-IDF + LR classical floor + frozen-probe dual role)
- ADR-019 — LoRA + transformer training recipe (this ADR partially supersedes ADR-019 seed slate only)
- ADR-020 — Compute infrastructure + cost discipline (preserved unchanged)
- ADR-026 — Module layout (concern-grouped 5-subpackage taxonomy; this ADR strictly honors)
- ADR-027 — Smoke vs canonical separation (this ADR closes the deferred fixture-pipeline wiring per line 75)
- ADR-041 — Phase 1 data implementation bundle (this ADR aligns Phase 2 seeds with Phase 1 materialization)
- ADR-043 — Post-split cross-source leakage cleanup (precedent for body-text-only partial supersession pattern)
Transcript
See transcripts/2026-05-16__phase-2-implementation.md for the /exploring-options Phase-2 walkthrough that produced this bundle.