Phase 1 library-first carryforward refactor — audit findings + remediation plan (splits + dedup + leakage_report + contamination_scan migration to eval-toolkit primitives; orphaned code removed in-commit)
ADR-047: Phase 1 library-first carryforward refactor
Status
Accepted (2026-05-16). Does not supersede ADR-041 (Phase 1 implementation bundle), ADR-042 (dedup-holdout protocol), or ADR-043 (post-split leakage cleanup) — those locked the methodology; this ADR locks the library-first carryforward refactor of the implementation while preserving every methodology constant bit-for-bit.
Context
How the audit was triggered
Phase 4 entry walkthrough generated a 7-question /exploring-options ratify session (per Phase 0 + Phase 2 + Phase 3 precedent). Q6 (figures slate) initially proposed hand-rolled matplotlib for the 7-figure slate; the user pushed back — eval-toolkit ships plotting.py with plot_pr_curve, plot_reliability_diagram, plot_bootstrap_distribution, plot_metric_bars, plot_lift_ci, plot_score_histograms, plot_confusion_matrix_grid, save_figure (PNG/PDF/SVG with provenance iTXt), set_plot_style + PALETTE — and the project’s library-first invariant requires consuming these before any local glue. Four upstream gaps surfaced (plot_roc_curve + plot_pareto_frontier + plot_slice_metric_heatmap + paired_bootstrap_diff n_jobs kwarg) and were filed as issues #14 + #15 + #16 + #17 on brandon-behring/eval-toolkit before any local Phase 4 figures-slate glue ships.
User then reaffirmed the rule as project-wide — “this is true throughout everything in this project so we want to make sure are not handrolling tools that can be better handled by my eval-toolkit and separately tested there with golden eval data sets etc” — and explicitly extended the audit retroactively to Phases 1 + 2 + 3 (already-shipped) plus prospectively to Phase 5 (planned).
Audit method
For each src/ subpackage, the audit (a) inventoried the public API surface (every def + class at module top-level), (b) inventoried the corresponding eval-toolkit module’s public API surface (where one exists with overlapping module-name semantics — src/data/dedup.py ↔︎ eval_toolkit.text_dedup; src/data/splits.py ↔︎ eval_toolkit.splits; src/data/audit.py ↔︎ eval_toolkit.leakage; etc.), (c) read both signatures + docstrings + reference impls + confirmed semantic match, (d) classified each local symbol as hand-roll (upstream primitive fits cleanly), partial overlap (upstream has machinery, project glue composes around it), or project-specific (no upstream equivalent; legitimate local code).
Audit results
| Phase | Module | Classification | Refactor target |
|---|---|---|---|
| Phase 1 | src/data/splits.py::make_splits |
hand-roll | SourceDisjointKFoldSplitter |
| Phase 1 | src/data/dedup.py (3 fns) |
hand-roll | near_dedup + cross_dedup + EmbeddingCosineStrategy(embedder=compute_embeddings) |
| Phase 1 | src/data/audit.py::compute_leakage_report |
hand-roll | run_leakage_checks([ExactDuplicateCheck(), NearDuplicateCheck(...), CrossSplitLeakageCheck()]) |
| Phase 1 | src/data/audit.py::compute_contamination_scan |
partial overlap | EmbeddingCosineStrategy(embedder=compute_embeddings).pairs_across(query, reference, k=1) + project per-source aggregation glue |
| Phase 1 | src/data/manifest_validation.py |
project-specific | config-time YAML schema for source_manifest.yaml — eval_toolkit.manifest is runtime artifact provenance — different concern |
| Phase 1 | src/data/loaders.py |
project-specific | per-source bespoke normalizers (HackAPrompt + LMSYS-Chat-1M + …) — HFDatasetsLoader is generic — project glue legitimate |
| Phase 1 | src/data/templates.py |
project-specific | HackAPrompt template extraction — no upstream equivalent |
| Phase 2 | src/training/*.py |
clean | HF Transformers + LoRA + sklearn — eval-toolkit doesn’t compete here |
| Phase 3 | src/scoring/*.py |
clean | LLM-judge wrappers + ProtectAI inference — no upstream overlap |
| Phase 3 | src/eval/{calibration_battery, operating_points, slice_analysis}.py |
clean | already library-first per decisions/library_imports.md lines 39-53 |
| Phase 3 | src/eval/schemas.py |
clean | pydantic models for project-specific row schemas; eval_toolkit.schemas/ is JSON schemas for artifact validation — different concern |
| Phase 5 (planned) | model card / Quarto / WRITEUP / HF Hub upload | clean | no upstream overlap |
The damning detail for hand-roll #1: eval_toolkit.splits.SourceDisjointKFoldSplitter’s own docstring reads “Generalizes the source-disjoint split pattern from prompt-injection-sdd” — meaning the upstream primitive was abstracted FROM a predecessor of THIS project, then we didn’t use it here. The strongest possible “should be upstream” signal.
Project-wide invariant codified
Two memory files locked the relevant rules:
library-first-is-project-wide-invariant— audit eval-toolkit + runpod-deploy + research_toolkit at EVERY module-design step, file upstream issues for gaps before workaround.no-orphaned-code-during-refactor— when refactoring to consume an upstream primitive, delete the now-unreachable local helpers in the SAME commit; never ship a transition commit with both paths live.
Both apply going forward to Phase 4 + Phase 5 work.
Decision
Refactor sequence (4 commits)
Each refactor lands as its own commit per the established cadence (Phase 1 + Phase 2 + Phase 3 all used per-commit module landings). Each commit (a) consumes the upstream primitive, (b) deletes the now-unreachable local helpers + unused imports in the same commit, (c) preserves locked methodology constants bit-for-bit, (d) re-runs the relevant invariants + fixture regeneration.
| Commit | Deliverable | Invariants verified |
|---|---|---|
| 1 (this) | ADR-047 + SPEC_SHEET §3.5 audit-findings row + SUBMISSION_AUDIT regen | n/a |
| 2 | src/data/splits.py refactor — SourceDisjointKFoldSplitter consumer; round-robin source-bucket logic deleted in-commit; project glue for nested seed split + stratified 80/20 train/val preserved per ADR-016 Q2 |
test_splits_lodo_partition_disjoint + test_splits_per_seed_stratification |
| 3 | src/data/dedup.py refactor — near_dedup + cross_dedup + EmbeddingCosineStrategy(embedder=compute_embeddings) consumers; local pairwise_cosines + _greedy_first_occurrence_mask deleted in-commit; project-owned embedder glue (get_encoder + compute_embeddings + encoder_revision_sha) preserved |
test_dedup_within_source_threshold_applied + test_dedup_cross_source_lmsys_priority + dedup-holdout calibration re-run cleanly |
| 4 | src/data/audit.py refactor — compute_leakage_report consumes run_leakage_checks; compute_contamination_scan consumes EmbeddingCosineStrategy.pairs_across; local _per_row_max_cosine_to_ref deleted in-commit; evals/leakage_report.json schema migration to LeakageReport-derived JSON (or adapter wrapper preserving downstream consumers) |
test_leakage_report_persisted + test_contamination_scan_persisted + 3 pre-existing data-pipeline invariants |
After Commit 4 lands + all data-pipeline invariants pass + make data-prepare regenerates 36 split parquets + dedup-holdout calibration re-runs cleanly, ADR-046 (Phase 4 implementation bundle) lands and Phase 4 Commit 1 begins.
Locked constants preserved bit-for-bit
| Constant | Value | Source | Refactor preserves |
|---|---|---|---|
| Within-source dedup threshold | 0.80 | ADR-016 Q4 | near_dedup(threshold=0.80) |
| Cross-source dedup threshold | 0.80 | ADR-016 Q4-A | cross_dedup(threshold=0.80) |
| Train↔︎test leakage cleanup threshold | 0.85 | ADR-043 | NearDuplicateCheck(threshold=0.85) |
| Benign contamination threshold | 0.85 | ADR-041 Q6 | pairs_across(...) >= 0.85 |
| Sentence-transformer model | all-MiniLM-L6-v2 |
ADR-016 Q4 | preserved in get_encoder() (project-owned) |
| LODO k folds | 4 | ADR-016 Q2 | SourceDisjointKFoldSplitter(k=4) |
| Seeds | (42, 43, 44) | ADR-006 | project seed-loop glue |
| Val fraction | 0.20 | ADR-016 Q2 | project train_test_split(test_size=0.20) |
Project-owned glue preserved
src/data/dedup.py::get_encoder()— sentence-transformer factory (project owns model choice + revision SHA perencoder_revision_sha())src/data/dedup.py::compute_embeddings(texts, batch_size=64)— batched encoding wrapper (project owns batch size + normalization choice)src/data/dedup.py::encoder_revision_sha()— pinning utility (project owns the pin)src/data/splits.py::FoldSeedSplitdataclass — project-specific (fold_id, seed, held_out_source, train, val, test) tuplesrc/data/splits.py::apply_leakage_cleanup— project orchestrator wiringdrop_train_test_leakage(post-refactor:cross_dedup) into the splits pipeline per ADR-043src/data/audit.py::compute_data_audit— per-source class-balance + length-distribution audit (project-specific reporting layout)- Per-source aggregation glue in
compute_contamination_scan— flag rate per source + percentile reporting (project-specific layout forevals/contamination_scan.json)
Upstream contributions
Two filed at audit close:
brandon-behring/eval-toolkit#18— wire the project’s 50-pair golden dedup-holdout dataset (data/dedup_holdout.jsonlper ADR-041 Q5 + ADR-042) into eval-toolkit’s CI test fixtures; closes the “separately tested there with golden eval data sets” gap directly.brandon-behring/eval-toolkit#19— add cookbook docs covering the 3 compositional patterns that caused the original hand-rolls (nested seed-split composition overSourceDisjointKFoldSplitter; callable-embedder pattern forEmbeddingCosineStrategy;pairs_acrossfor cross-corpus contamination scan).
Consequences
Positive:
- Library-first invariant honored retroactively; project audit story strengthens (reviewer-facing — explicit ADR documenting the audit + remediation + upstream contributions; no silent technical debt).
- ~50 percent of
src/data/LoC removed (orphaned helpers deleted per no-orphaned-code discipline); maintenance surface shrinks correspondingly. - Upstream eval-toolkit benefits from the golden dedup-holdout contribution (issue #18) — every future eval-toolkit consumer gains regression coverage against this project’s adversarial corpus.
- The 3-pattern cookbook docs contribution (issue #19) reduces the likelihood of similar hand-rolls in other downstream consumers + in this project’s future modules.
- Phase 4 + Phase 5 work proceeds with cleaner library-first hygiene as the baseline expectation.
Negative / cost:
- Phase 4 start delayed by 4 commits (~half-day to a day of focused work for the refactors + invariant re-runs + fixture regeneration).
evals/leakage_report.jsonschema migration creates a one-time discontinuity in the audit artifact format; downstream consumers (any analysis script reading the JSON) need updating in the same commit.- Dedup-holdout calibration re-runs may surface tiny numerical drift if
near_dedup’s greedy-scan order differs imperceptibly from local impl (e.g., tie-breaking); rare but not impossible. make data-preparere-runs the full 11-source data pipeline (~10 min wall-clock per ADR-041 Q7); per-fold split parquets regenerate; downstream Phase 2 trained-rung parquets are unaffected (they consume the splits + retrain triggers stay manual).
Neutral:
- Methodology semantics unchanged (every locked constant + every threshold + every protocol preserved bit-for-bit); no methodology ADR superseded.
decisions/library_imports.mdledger expands with the new eval-toolkit primitive uses (planned post-refactor table-update).SUBMISSION_AUDIT.mdregenerates clean; CI gate stays green.- ADR-046 (Phase 4 implementation bundle) writing is briefly deferred but content unchanged.
Alternatives Considered
- File issues + write ADR-047 (defer refactor) + proceed to Phase 4 immediately: Skip the refactor; ADR-047 documents “acknowledged + scheduled post-submission per
v1.0.x”. Rejected because: lower discipline; reviewer-facing audit story weaker; by the time submission lands, carryforward debt may be larger; the user explicitly chose Option A (refactor now) when presented with the four-way alternative. - File issues only + proceed to Phase 4 (skip ADR for now): Just file the upstream issues to mark the gap; no ADR. Rejected because: no immutable record of the audit + decision; future-you may not remember the gap exists; weakest discipline option; violates the SPEC_GREENFIELD anti-pattern “Adding a methodology component without an ADR” (the refactor is a methodology-adjacent change even though it preserves methodology semantics).
- Pause Phase 4 entirely; full Phase 1 audit + refactor sprint (3-5 days): Treat this as a Phase 1 errata sprint; refactor + comprehensive 4th-confirmation + dedup-holdout calibration re-run against upstream primitives + extra scoping discussions. Rejected because: overscope — the audit findings are already confirmed at a level sufficient for the 4-commit refactor; a multi-day sprint would lose Phase 4 momentum without proportional benefit.
- Refactor only the most damning hand-roll (
make_splits) + defer the rest: Skip dedup + leakage + contamination_scan refactors. Rejected because: arbitrary; if the library-first invariant applies, it applies to all 4 hand-rolls; partial refactor leaves the inconsistency the audit was meant to close. - In-place rewrite of ADR-041 to embed the library-first language retroactively: Edit ADR-041 to say “uses eval_toolkit primitives” as if it always had. Rejected because: violates the ADR immutability constraint per
decisions/README.md(“ADRs are immutable; supersede via new ADR”); the right pattern is exactly what this ADR does — a new ADR documenting the audit + remediation + carryforward.
References
decisions/ADR-016-data-design-bundle.md— methodology source for LODO k=4 plus seed plus threshold constantsdecisions/ADR-041-phase-1-data-implementation-bundle.md— Phase 1 implementation bundle being carryforward-refactoreddecisions/ADR-042-llm-prelabel-dedup-holdout-bootstrap.md— dedup-holdout dataset that issue #18 contributes upstreamdecisions/ADR-043-post-split-cross-source-leakage-cleanup.md— leakage cleanup threshold = 0.85 sourcedecisions/ADR-036-library-version-pins-tag-pin-plus-freeze.md— eval-toolkitv0.31.0pindecisions/library_imports.md— discipline ledger (positive evidence of upstream consumption; expands post-refactor)decisions/upstream_issues.md— discipline ledger (gap-filing record; entries #14-19 listed)https://github.com/brandon-behring/eval-toolkit/issues/18— golden dedup-holdout test contributionhttps://github.com/brandon-behring/eval-toolkit/issues/19— 3-pattern cookbook docs contributionhttps://github.com/brandon-behring/eval-toolkit/blob/v0.31.0/src/eval_toolkit/splits.py—SourceDisjointKFoldSplittersource (docstring cites the predecessor abstraction)https://github.com/brandon-behring/eval-toolkit/blob/v0.31.0/src/eval_toolkit/text_dedup.py—near_dedup+cross_dedup+EmbeddingCosineStrategy+SimilarityStrategy.pairs_acrosssourcehttps://github.com/brandon-behring/eval-toolkit/blob/v0.31.0/src/eval_toolkit/leakage.py—run_leakage_checks+ExactDuplicateCheck+NearDuplicateCheck+CrossSplitLeakageChecksource
Transcript
See transcripts/2026-05-16__phase-4-entry-plus-phase-1-library-first-refactor.md for the conversation that led to this decision (the Phase 4 entry walkthrough Q6 user-reaffirmation + retroactive audit pass + four-way option ratification).