Worked example: nested seed-split composition over SourceDisjointKFoldSplitter#
What this shows. The canonical OOD-CV protocol per Bayle 2020 + matched-budget per HELM lineage: LODO k-fold splits × multiple seeds × within-fold stratified train/val. Each toolkit primitive does one thing; the composition is project glue.
Runtime: ~1 s. Pure-numpy/sklearn core; no optional deps.
Why this composition#
SourceDisjointKFoldSplitter is single-seed — its seed= only shuffles
the source-bucket order, not the per-row split. For multi-seed protocols
(standard for variance estimation across re-runs), loop seeds outside the
splitter and apply stratified train/val per seed. Each fold’s train
partition then gets sub-split into train+val with the per-seed
sklearn.model_selection.train_test_split.
This was a v0.34.0-era cookbook addition (closes #19) — the pattern is not invented by the toolkit; the splitter API is intentionally minimal and composes with stdlib + sklearn for the loops that wrap it.
Setup#
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from eval_toolkit import EvalSlice, SourceDisjointKFoldSplitter, set_global_seeds
set_global_seeds(42)
Synthetic data: 4 sources, balanced labels#
rng = np.random.default_rng(42)
n_per_source = 50
df = pd.DataFrame({
"text": [f"row_{i}" for i in range(4 * n_per_source)],
"label": rng.integers(0, 2, size=4 * n_per_source),
"source": np.repeat(["src_a", "src_b", "src_c", "src_d"], n_per_source),
})
parent = EvalSlice(df=df, name="full_corpus")
LODO k-fold × multi-seed × stratified train/val#
splitter = SourceDisjointKFoldSplitter(source_col="source", k=4, seed=42)
seeds = (42, 43, 44)
split_count = 0
for fold_idx, fold_dict in enumerate(splitter.iter_folds(parent)):
train_full = fold_dict["train"] # 3-source train partition (EvalSlice)
test_slice = fold_dict["test"] # held-out 1-source test partition
for seed in seeds:
# Project glue: stratified train/val split per seed.
# The splitter doesn't own this — sklearn does, and that's intentional.
train_idx, val_idx = train_test_split(
range(len(train_full.df)),
stratify=train_full.df["label"],
test_size=0.20,
random_state=seed,
)
# 4 folds × 3 seeds × {train, val, test} = 36 splits
split_count += 3
assert split_count == 4 * 3 * 3 # 36 splits total
Common pitfalls#
Don’t pass
seed=to the splitter for variance estimation. The splitter seed only shuffles source-bucket order — it does NOT produce different per-row partitions. Each fold’s row membership is deterministic given(source_col, k). Variance comes from the per-seed train/val split, not from the fold partition.Stratify on label, not source. Source is already partitioned at fold level; stratifying on source within the train partition would bias against the LODO design intent.
See also#
SourceDisjointKFoldSplitter(API)PurgedKFoldSplitterfor time-aware LODO with embargomethodology/splits.md for the fold-design rationale + when to use each splitter