--- jupytext: text_representation: extension: .md format_name: myst kernelspec: display_name: Python 3 language: python name: python3 --- # Worked example: nested seed-split composition over `SourceDisjointKFoldSplitter` > **What this shows.** The canonical OOD-CV protocol per Bayle 2020 + > matched-budget per HELM lineage: LODO k-fold splits × multiple seeds × > within-fold stratified train/val. Each toolkit primitive does one thing; > the composition is project glue. > > **Runtime:** ~1 s. Pure-numpy/sklearn core; no optional deps. ## Why this composition `SourceDisjointKFoldSplitter` is *single-seed* — its `seed=` only shuffles the source-bucket order, not the per-row split. For multi-seed protocols (standard for variance estimation across re-runs), loop seeds *outside* the splitter and apply stratified train/val per seed. Each fold's train partition then gets sub-split into train+val with the per-seed `sklearn.model_selection.train_test_split`. This was a v0.34.0-era cookbook addition (closes #19) — the pattern is **not** invented by the toolkit; the splitter API is intentionally minimal and composes with stdlib + sklearn for the loops that wrap it. ## Setup ```{code-cell} import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from eval_toolkit import EvalSlice, SourceDisjointKFoldSplitter, set_global_seeds set_global_seeds(42) ``` ## Synthetic data: 4 sources, balanced labels ```{code-cell} rng = np.random.default_rng(42) n_per_source = 50 df = pd.DataFrame({ "text": [f"row_{i}" for i in range(4 * n_per_source)], "label": rng.integers(0, 2, size=4 * n_per_source), "source": np.repeat(["src_a", "src_b", "src_c", "src_d"], n_per_source), }) parent = EvalSlice(df=df, name="full_corpus") ``` ## LODO k-fold × multi-seed × stratified train/val ```{code-cell} splitter = SourceDisjointKFoldSplitter(source_col="source", k=4, seed=42) seeds = (42, 43, 44) split_count = 0 for fold_idx, fold_dict in enumerate(splitter.iter_folds(parent)): train_full = fold_dict["train"] # 3-source train partition (EvalSlice) test_slice = fold_dict["test"] # held-out 1-source test partition for seed in seeds: # Project glue: stratified train/val split per seed. # The splitter doesn't own this — sklearn does, and that's intentional. train_idx, val_idx = train_test_split( range(len(train_full.df)), stratify=train_full.df["label"], test_size=0.20, random_state=seed, ) # 4 folds × 3 seeds × {train, val, test} = 36 splits split_count += 3 assert split_count == 4 * 3 * 3 # 36 splits total ``` ## Common pitfalls - **Don't pass `seed=` to the splitter for variance estimation.** The splitter seed only shuffles source-bucket order — it does NOT produce different per-row partitions. Each fold's row membership is deterministic given `(source_col, k)`. Variance comes from the per-seed *train/val* split, not from the fold partition. - **Stratify on label, not source.** Source is already partitioned at fold level; stratifying on source within the train partition would bias against the LODO design intent. ## See also - `SourceDisjointKFoldSplitter` ([API](../api/splits.md)) - `PurgedKFoldSplitter` for time-aware LODO with embargo - [methodology/splits.md](../methodology/splits.md) for the fold-design rationale + when to use each splitter