Phase 1 library-first carryforward refactor — audit findings + remediation plan (splits + dedup + leakage_report + contamination_scan migration to eval-toolkit primitives; orphaned code removed in-commit)

Published

May 16, 2026

ADR-047: Phase 1 library-first carryforward refactor

Status

Accepted (2026-05-16). Does not supersede ADR-041 (Phase 1 implementation bundle), ADR-042 (dedup-holdout protocol), or ADR-043 (post-split leakage cleanup) — those locked the methodology; this ADR locks the library-first carryforward refactor of the implementation while preserving every methodology constant bit-for-bit.

Context

How the audit was triggered

Phase 4 entry walkthrough generated a 7-question /exploring-options ratify session (per Phase 0 + Phase 2 + Phase 3 precedent). Q6 (figures slate) initially proposed hand-rolled matplotlib for the 7-figure slate; the user pushed back — eval-toolkit ships plotting.py with plot_pr_curve, plot_reliability_diagram, plot_bootstrap_distribution, plot_metric_bars, plot_lift_ci, plot_score_histograms, plot_confusion_matrix_grid, save_figure (PNG/PDF/SVG with provenance iTXt), set_plot_style + PALETTE — and the project’s library-first invariant requires consuming these before any local glue. Four upstream gaps surfaced (plot_roc_curve + plot_pareto_frontier + plot_slice_metric_heatmap + paired_bootstrap_diff n_jobs kwarg) and were filed as issues #14 + #15 + #16 + #17 on brandon-behring/eval-toolkit before any local Phase 4 figures-slate glue ships.

User then reaffirmed the rule as project-wide — “this is true throughout everything in this project so we want to make sure are not handrolling tools that can be better handled by my eval-toolkit and separately tested there with golden eval data sets etc” — and explicitly extended the audit retroactively to Phases 1 + 2 + 3 (already-shipped) plus prospectively to Phase 5 (planned).

Audit method

For each src/ subpackage, the audit (a) inventoried the public API surface (every def + class at module top-level), (b) inventoried the corresponding eval-toolkit module’s public API surface (where one exists with overlapping module-name semantics — src/data/dedup.py ↔︎ eval_toolkit.text_dedup; src/data/splits.py ↔︎ eval_toolkit.splits; src/data/audit.py ↔︎ eval_toolkit.leakage; etc.), (c) read both signatures + docstrings + reference impls + confirmed semantic match, (d) classified each local symbol as hand-roll (upstream primitive fits cleanly), partial overlap (upstream has machinery, project glue composes around it), or project-specific (no upstream equivalent; legitimate local code).

Audit results

Phase	Module	Classification	Refactor target
Phase 1	`src/data/splits.py::make_splits`	hand-roll	`SourceDisjointKFoldSplitter`
Phase 1	`src/data/dedup.py` (3 fns)	hand-roll	`near_dedup` + `cross_dedup` + `EmbeddingCosineStrategy(embedder=compute_embeddings)`
Phase 1	`src/data/audit.py::compute_leakage_report`	hand-roll	`run_leakage_checks([ExactDuplicateCheck(), NearDuplicateCheck(...), CrossSplitLeakageCheck()])`
Phase 1	`src/data/audit.py::compute_contamination_scan`	partial overlap	`EmbeddingCosineStrategy(embedder=compute_embeddings).pairs_across(query, reference, k=1)` + project per-source aggregation glue
Phase 1	`src/data/manifest_validation.py`	project-specific	config-time YAML schema for `source_manifest.yaml` — `eval_toolkit.manifest` is runtime artifact provenance — different concern
Phase 1	`src/data/loaders.py`	project-specific	per-source bespoke normalizers (HackAPrompt + LMSYS-Chat-1M + …) — `HFDatasetsLoader` is generic — project glue legitimate
Phase 1	`src/data/templates.py`	project-specific	HackAPrompt template extraction — no upstream equivalent
Phase 2	`src/training/*.py`	clean	HF Transformers + LoRA + sklearn — `eval-toolkit` doesn’t compete here
Phase 3	`src/scoring/*.py`	clean	LLM-judge wrappers + ProtectAI inference — no upstream overlap
Phase 3	`src/eval/{calibration_battery, operating_points, slice_analysis}.py`	clean	already library-first per `decisions/library_imports.md` lines 39-53
Phase 3	`src/eval/schemas.py`	clean	pydantic models for project-specific row schemas; `eval_toolkit.schemas/` is JSON schemas for artifact validation — different concern
Phase 5 (planned)	model card / Quarto / WRITEUP / HF Hub upload	clean	no upstream overlap

The damning detail for hand-roll #1: eval_toolkit.splits.SourceDisjointKFoldSplitter’s own docstring reads “Generalizes the source-disjoint split pattern from prompt-injection-sdd” — meaning the upstream primitive was abstracted FROM a predecessor of THIS project, then we didn’t use it here. The strongest possible “should be upstream” signal.

Project-wide invariant codified

Two memory files locked the relevant rules:

library-first-is-project-wide-invariant — audit eval-toolkit + runpod-deploy + research_toolkit at EVERY module-design step, file upstream issues for gaps before workaround.
no-orphaned-code-during-refactor — when refactoring to consume an upstream primitive, delete the now-unreachable local helpers in the SAME commit; never ship a transition commit with both paths live.

Both apply going forward to Phase 4 + Phase 5 work.

Decision

Refactor sequence (4 commits)

Each refactor lands as its own commit per the established cadence (Phase 1 + Phase 2 + Phase 3 all used per-commit module landings). Each commit (a) consumes the upstream primitive, (b) deletes the now-unreachable local helpers + unused imports in the same commit, (c) preserves locked methodology constants bit-for-bit, (d) re-runs the relevant invariants + fixture regeneration.

Commit	Deliverable	Invariants verified
1 (this)	ADR-047 + SPEC_SHEET §3.5 audit-findings row + SUBMISSION_AUDIT regen	n/a
2	`src/data/splits.py` refactor — `SourceDisjointKFoldSplitter` consumer; round-robin source-bucket logic deleted in-commit; project glue for nested seed split + stratified 80/20 train/val preserved per ADR-016 Q2	`test_splits_lodo_partition_disjoint` + `test_splits_per_seed_stratification`
3	`src/data/dedup.py` refactor — `near_dedup` + `cross_dedup` + `EmbeddingCosineStrategy(embedder=compute_embeddings)` consumers; local `pairwise_cosines` + `_greedy_first_occurrence_mask` deleted in-commit; project-owned embedder glue (`get_encoder` + `compute_embeddings` + `encoder_revision_sha`) preserved	`test_dedup_within_source_threshold_applied` + `test_dedup_cross_source_lmsys_priority` + dedup-holdout calibration re-run cleanly
4	`src/data/audit.py` refactor — `compute_leakage_report` consumes `run_leakage_checks`; `compute_contamination_scan` consumes `EmbeddingCosineStrategy.pairs_across`; local `_per_row_max_cosine_to_ref` deleted in-commit; `evals/leakage_report.json` schema migration to `LeakageReport`-derived JSON (or adapter wrapper preserving downstream consumers)	`test_leakage_report_persisted` + `test_contamination_scan_persisted` + 3 pre-existing data-pipeline invariants

After Commit 4 lands + all data-pipeline invariants pass + make data-prepare regenerates 36 split parquets + dedup-holdout calibration re-runs cleanly, ADR-046 (Phase 4 implementation bundle) lands and Phase 4 Commit 1 begins.

Locked constants preserved bit-for-bit

Constant	Value	Source	Refactor preserves
Within-source dedup threshold	0.80	ADR-016 Q4	`near_dedup(threshold=0.80)`
Cross-source dedup threshold	0.80	ADR-016 Q4-A	`cross_dedup(threshold=0.80)`
Train↔︎test leakage cleanup threshold	0.85	ADR-043	`NearDuplicateCheck(threshold=0.85)`
Benign contamination threshold	0.85	ADR-041 Q6	`pairs_across(...) >= 0.85`
Sentence-transformer model	`all-MiniLM-L6-v2`	ADR-016 Q4	preserved in `get_encoder()` (project-owned)
LODO k folds	4	ADR-016 Q2	`SourceDisjointKFoldSplitter(k=4)`
Seeds	(42, 43, 44)	ADR-006	project seed-loop glue
Val fraction	0.20	ADR-016 Q2	project `train_test_split(test_size=0.20)`

Project-owned glue preserved

src/data/dedup.py::get_encoder() — sentence-transformer factory (project owns model choice + revision SHA per encoder_revision_sha())
src/data/dedup.py::compute_embeddings(texts, batch_size=64) — batched encoding wrapper (project owns batch size + normalization choice)
src/data/dedup.py::encoder_revision_sha() — pinning utility (project owns the pin)
src/data/splits.py::FoldSeedSplit dataclass — project-specific (fold_id, seed, held_out_source, train, val, test) tuple
src/data/splits.py::apply_leakage_cleanup — project orchestrator wiring drop_train_test_leakage (post-refactor: cross_dedup) into the splits pipeline per ADR-043
src/data/audit.py::compute_data_audit — per-source class-balance + length-distribution audit (project-specific reporting layout)
Per-source aggregation glue in compute_contamination_scan — flag rate per source + percentile reporting (project-specific layout for evals/contamination_scan.json)

Upstream contributions

Two filed at audit close:

brandon-behring/eval-toolkit#18 — wire the project’s 50-pair golden dedup-holdout dataset (data/dedup_holdout.jsonl per ADR-041 Q5 + ADR-042) into eval-toolkit’s CI test fixtures; closes the “separately tested there with golden eval data sets” gap directly.
brandon-behring/eval-toolkit#19 — add cookbook docs covering the 3 compositional patterns that caused the original hand-rolls (nested seed-split composition over SourceDisjointKFoldSplitter; callable-embedder pattern for EmbeddingCosineStrategy; pairs_across for cross-corpus contamination scan).

Consequences

Positive:

Library-first invariant honored retroactively; project audit story strengthens (reviewer-facing — explicit ADR documenting the audit + remediation + upstream contributions; no silent technical debt).
~50 percent of src/data/ LoC removed (orphaned helpers deleted per no-orphaned-code discipline); maintenance surface shrinks correspondingly.
Upstream eval-toolkit benefits from the golden dedup-holdout contribution (issue #18) — every future eval-toolkit consumer gains regression coverage against this project’s adversarial corpus.
The 3-pattern cookbook docs contribution (issue #19) reduces the likelihood of similar hand-rolls in other downstream consumers + in this project’s future modules.
Phase 4 + Phase 5 work proceeds with cleaner library-first hygiene as the baseline expectation.

Negative / cost:

Phase 4 start delayed by 4 commits (~half-day to a day of focused work for the refactors + invariant re-runs + fixture regeneration).
evals/leakage_report.json schema migration creates a one-time discontinuity in the audit artifact format; downstream consumers (any analysis script reading the JSON) need updating in the same commit.
Dedup-holdout calibration re-runs may surface tiny numerical drift if near_dedup’s greedy-scan order differs imperceptibly from local impl (e.g., tie-breaking); rare but not impossible.
make data-prepare re-runs the full 11-source data pipeline (~10 min wall-clock per ADR-041 Q7); per-fold split parquets regenerate; downstream Phase 2 trained-rung parquets are unaffected (they consume the splits + retrain triggers stay manual).

Neutral:

Methodology semantics unchanged (every locked constant + every threshold + every protocol preserved bit-for-bit); no methodology ADR superseded.
decisions/library_imports.md ledger expands with the new eval-toolkit primitive uses (planned post-refactor table-update).
SUBMISSION_AUDIT.md regenerates clean; CI gate stays green.
ADR-046 (Phase 4 implementation bundle) writing is briefly deferred but content unchanged.

Alternatives Considered

File issues + write ADR-047 (defer refactor) + proceed to Phase 4 immediately: Skip the refactor; ADR-047 documents “acknowledged + scheduled post-submission per v1.0.x”. Rejected because: lower discipline; reviewer-facing audit story weaker; by the time submission lands, carryforward debt may be larger; the user explicitly chose Option A (refactor now) when presented with the four-way alternative.
File issues only + proceed to Phase 4 (skip ADR for now): Just file the upstream issues to mark the gap; no ADR. Rejected because: no immutable record of the audit + decision; future-you may not remember the gap exists; weakest discipline option; violates the SPEC_GREENFIELD anti-pattern “Adding a methodology component without an ADR” (the refactor is a methodology-adjacent change even though it preserves methodology semantics).
Pause Phase 4 entirely; full Phase 1 audit + refactor sprint (3-5 days): Treat this as a Phase 1 errata sprint; refactor + comprehensive 4th-confirmation + dedup-holdout calibration re-run against upstream primitives + extra scoping discussions. Rejected because: overscope — the audit findings are already confirmed at a level sufficient for the 4-commit refactor; a multi-day sprint would lose Phase 4 momentum without proportional benefit.
Refactor only the most damning hand-roll (make_splits) + defer the rest: Skip dedup + leakage + contamination_scan refactors. Rejected because: arbitrary; if the library-first invariant applies, it applies to all 4 hand-rolls; partial refactor leaves the inconsistency the audit was meant to close.
In-place rewrite of ADR-041 to embed the library-first language retroactively: Edit ADR-041 to say “uses eval_toolkit primitives” as if it always had. Rejected because: violates the ADR immutability constraint per decisions/README.md (“ADRs are immutable; supersede via new ADR”); the right pattern is exactly what this ADR does — a new ADR documenting the audit + remediation + carryforward.

References

decisions/ADR-016-data-design-bundle.md — methodology source for LODO k=4 plus seed plus threshold constants
decisions/ADR-041-phase-1-data-implementation-bundle.md — Phase 1 implementation bundle being carryforward-refactored
decisions/ADR-042-llm-prelabel-dedup-holdout-bootstrap.md — dedup-holdout dataset that issue #18 contributes upstream
decisions/ADR-043-post-split-cross-source-leakage-cleanup.md — leakage cleanup threshold = 0.85 source
decisions/ADR-036-library-version-pins-tag-pin-plus-freeze.md — eval-toolkit v0.31.0 pin
decisions/library_imports.md — discipline ledger (positive evidence of upstream consumption; expands post-refactor)
decisions/upstream_issues.md — discipline ledger (gap-filing record; entries #14-19 listed)
https://github.com/brandon-behring/eval-toolkit/issues/18 — golden dedup-holdout test contribution
https://github.com/brandon-behring/eval-toolkit/issues/19 — 3-pattern cookbook docs contribution
https://github.com/brandon-behring/eval-toolkit/blob/v0.31.0/src/eval_toolkit/splits.py — SourceDisjointKFoldSplitter source (docstring cites the predecessor abstraction)
https://github.com/brandon-behring/eval-toolkit/blob/v0.31.0/src/eval_toolkit/text_dedup.py — near_dedup + cross_dedup + EmbeddingCosineStrategy + SimilarityStrategy.pairs_across source
https://github.com/brandon-behring/eval-toolkit/blob/v0.31.0/src/eval_toolkit/leakage.py — run_leakage_checks + ExactDuplicateCheck + NearDuplicateCheck + CrossSplitLeakageCheck source

Transcript

See transcripts/2026-05-16__phase-4-entry-plus-phase-1-library-first-refactor.md for the conversation that led to this decision (the Phase 4 entry walkthrough Q6 user-reaffirmation + retroactive audit pass + four-way option ratification).