LLM-judge pre-label as Q5 holdout-labeling bootstrap with human override
ADR-042: LLM-judge pre-label as Q5 holdout-labeling bootstrap with human override
Status
Accepted (2026-05-16).
Context
Phase 1 Commit 3 landed the dedup pipeline per ADR-016 Q4 + ADR-041 Q5 — including the 50-pair stratified-cosine-band holdout builder (scripts/build_dedup_holdout.py) and the calibration script (scripts/calibrate_dedup.py). ADR-041 Q5 locked option A (Brandon hand-labels each of 50 candidate pairs) and explicitly rejected option D (LLM-judge labels 100 pairs plus Brandon spot-checks 20) on the grounds that the judge prior could contaminate the calibration and that near-duplicate judgment quality on LLM judges is unknown.
Operationally, however, hand-labeling 50 pairs of prompt-injection texts via cold-start visual inspection imposes approximately 1-2 hours of researcher time. User flagged a pragmatic acceleration — let an LLM judge produce a preliminary label per pair so the researcher’s hand-examination becomes an override pass rather than a from-scratch pass.
The methodological tension to resolve: how to accelerate without sacrificing the rigor that ADR-041 Q5 locked. The answer landed on a layered-provenance design — both labels persist; humans win where present; calibration discloses the labeling provenance fraction.
Decision
Bootstrap workflow
build_dedup_holdout.py -> llm_prelabel_dedup_holdout.py -> Brandon reviews -> calibrate_dedup.py
(generates 50 candidates) (sets llm_judge_label) (sets human_label) (resolves + persists)
Per-pair fields persisted to data/dedup_holdout.jsonl
| Field | Type | Set by | Notes |
|---|---|---|---|
pair_id |
int | builder | unique ID |
text_a, text_b, idx_a, idx_b, source_a, source_b, cosine, band |
various | builder | candidate-pair payload |
llm_judge_label |
bool or null | prelabel script | gpt-4o-2024-08-06 verdict |
llm_judge_reasoning |
str | prelabel script | one-sentence rationale (max ~250 chars) |
llm_judge_model |
str | prelabel script | snapshot identifier (gpt-4o-2024-08-06) |
human_label |
bool or null | Brandon (manual) | null = not yet reviewed |
true_duplicate |
bool | calibrate script | resolved via priority (see below) |
Label-resolution rule (calibrate-time)
if human_label is not None:
true_duplicate = human_label
provenance = "human"
elif llm_judge_label is not None:
true_duplicate = llm_judge_label
provenance = "llm_judge"
else:
HoldoutNotLabeledError
Calibration JSON disclosure (evals/dedup_calibration.json)
The calibration JSON gains a top-level label_provenance block:
{
"label_provenance": {
"human_verified_count": 12,
"llm_judge_only_count": 38,
"human_verified_pct": 24.0,
"llm_judge_model": "gpt-4o-2024-08-06"
}
}This block lets reviewers read off exactly which fraction of the 50 labels carries human verification vs LLM-judge only. The eventual goal is human_verified_pct = 100.0 once Brandon’s hand-examination completes; the bootstrap state permits a calibration run at lower verification (preliminary; flagged in the calibration JSON).
Judge model + parameters
- Model —
gpt-4o-2024-08-06(matches ADR-018 LLM-as-rater snapshot for consistency across the LLM-judge usage surface). - Temperature — 0.0 (deterministic judgments).
- Response format — JSON object
{is_duplicate: bool, reasoning: str}. - max_tokens — 300 (typical responses <100 tokens).
- Retry policy — exponential backoff 3 attempts; failures recorded with
llm_judge_label: nullplusFAILED:reasoning text.
Cost envelope
Approximately $0.0025 per pair at 2026-05 OpenAI pricing for gpt-4o-2024-08-06; 50 pairs ~= $0.12 total. Well below the ADR-016 compute envelope ($60-115).
Key invariant — methodology is still hand-labeled
The ground-truth labeling methodology stays Brandon-hand-labeled — the LLM bootstrap is a UI affordance that fills in initial values for the researcher’s review. Brandon’s human_label is the authoritative ground truth wherever set. The calibration JSON’s human_verified_pct discloses to the reviewer exactly how much of the calibration is human-verified.
Consequences
Positive:
- Acceleration — researcher labeling time approximately halves; LLM-pre-label makes each pair a fast review-and-confirm rather than a from-scratch decision.
- Provenance preserved —
llm_judge_label+human_labelpersist as separate fields; calibration JSON discloses the split. - ADR-018 LLM-judge usage gets a small dogfood — same model snapshot used for headline rating gets exercised on the dedup-judgment task; any quality issues surface early.
- Re-runnable — if Brandon overrides many LLM labels, the calibration JSON’s
human_verified_pctrises; calibration regenerates trivially. - Compatible with ADR-041 Q5 strict option A — when
human_verified_pct = 100, the LLM-pre-label is effectively just initial UI state that was always overwritten; no methodology compromise.
Negative / cost:
- Additional dep —
openai>=1.50enterspyproject.toml; openai SDK adds approximately 5MB. Acceptable. - Cost — approximately $0.12 for 50 pairs; trivial vs total budget.
- Risk of researcher confirmation bias — if the LLM’s preliminary label biases Brandon’s judgment, the hand-examination becomes less independent. Mitigation —
llm_judge_reasoningis shown alongside the label so Brandon evaluates the reasoning + can spot weak LLM justifications. - Risk of low
human_verified_pctat submission — if Brandon does not hand-examine all 50 before submission, the calibration JSON ships with disclosurehuman_verified_pct < 100and reviewer may flag. Mitigation — submission-readiness gate 1 (zero[OPEN]in SPEC_SHEET) is checked at v1.0.0 tag; SPEC_SHEET §3.5 status row tracks Phase 1 Commit 3 close including holdout-labeling completeness.
Neutral:
- ADR-041 Q5 remains the authoritative locked methodology; ADR-042 is a tooling refinement that operates within Q5.
- The 0.80 threshold + the MiniLM encoder + the stratified-cosine-band sampling structure are unchanged.
Alternatives Considered
- Pure ADR-041 Q5 option A (no LLM bootstrap): rejected for this iteration as the user explicitly requested LLM-pre-label acceleration; ADR-041 Q5’s methodology is preserved via the human-override + provenance-disclosure design.
- Pure LLM-judge labeling (no human override) per ADR-041 Q5 option D: rejected — same rationale ADR-041 Q5 used; pure LLM labels carry judge-prior contamination risk.
- Human-only labeling with LLM as second opinion (inverse provenance priority): rejected — would require Brandon to label from scratch first; defeats the acceleration purpose.
- Different judge model (gpt-4o-mini-2024-07-18, o1-mini): rejected for this iteration — gpt-4o-2024-08-06 matches the ADR-018 snapshot for consistency. Cheaper or smarter alternatives can be explored as a future-work axis if the judge quality surfaces issues.
References
- ADR-016 (data design bundle — dedup encoder + threshold locks)
- ADR-018 (LLM-judge and reference-rung pins —
gpt-4o-2024-08-06snapshot) - ADR-035 (secrets management —
OPENAI_API_KEYenv-var discovery) - ADR-041 (Phase 1 implementation bundle — Q5 stratified-cosine-band holdout)
- OpenAI gpt-4o model card — https://platform.openai.com/docs/models/gpt-4o
Transcript
See transcripts/2026-05-16__phase-1-implementation.md for the conversation that led to this decision.