# Splits

> **Background** *(skip if you've internalized this)*. A *split* partitions
> your dataset into train and one or more eval sets. The choice of how to
> split is methodological: random partitioning measures interpolation
> across your sample; group-disjoint partitioning measures generalization
> across users / patients / sources; time-aware partitioning measures
> forecasting. The wrong split silently inflates metrics — a model can
> ace random K-fold and still flop in production because the production
> distribution sees groups / time periods / sources the training set
> never saw.

This chapter covers when to use each [`Splitter`
](../api/splits.md) reference impl, when CV alone is
insufficient, and how to compose splits with leakage checks.

## Setup

```python
import numpy as np
import pandas as pd
from eval_toolkit import EvalSlice
```

A 40-row mixed fixture used throughout:

```python
df = pd.DataFrame({
    "text":   [f"row_{i}" for i in range(40)],
    "label":  [i % 2 for i in range(40)],
    "group":  [i // 4 for i in range(40)],   # 10 groups of 4 rows each
    "source": [f"src_{i % 5}" for i in range(40)],  # 5 sources
    "t":      np.arange(40),                  # monotone timestamps
})
parent = EvalSlice(name="all", df=df)
```

(holdout-vs-kfold)=
## Holdout vs K-fold
**Holdout.** A single train/test partition. Cheap, single number, easy to
report. Variance is high — a different random seed gives a different
metric. Suitable for a *final* evaluation against a frozen test set.

**K-fold cross-validation.** K train/test partitions; report mean ± CI of
the per-fold metric. Lower variance, better use of small data. Suitable
for *development* (model selection, hyperparameter tuning,
methodology choices).

**Don't conflate the two.** Use K-fold to *develop*, then a separately
held-out test set to *report*. Reporting a K-fold mean as if it were a
holdout is a subtle leakage of model-selection information into the
metric.

```python
from eval_toolkit import HoldoutSplitter, StratifiedKFoldSplitter

holdout = HoldoutSplitter(test_size=0.25, seed=42)
folds = list(holdout.iter_folds(parent))
print(f"holdout: {len(folds)} fold(s); train={len(folds[0]['train'].df)} test={len(folds[0]['test'].df)}")

cv = StratifiedKFoldSplitter(k=5, seed=42)
print(f"5-fold: {cv.get_n_splits(parent)} folds")
```

Both Splitters yield the same shape — `dict[str, EvalSlice]` keyed by
``"train"`` / ``"test"``. Holdout is just K=1, so consumer code can treat
both identically:

```python
for splitter in [HoldoutSplitter(seed=42), StratifiedKFoldSplitter(k=3, seed=42)]:
    for i, fold in enumerate(splitter.iter_folds(parent)):
        print(f"  {type(splitter).__name__} fold {i}: "
              f"train={len(fold['train'].df)} test={len(fold['test'].df)}")
```

(stratified-kfold)=
## Stratified K-fold
**When to use.** Default for binary classification with class imbalance.
Keeps the positive/negative ratio stable across folds, which keeps PR-AUC
estimates comparable across folds.

**Primitive.**
[`StratifiedKFoldSplitter`](../api/splits.md) wraps
`sklearn.model_selection.StratifiedKFold`.

```python
imbalanced = pd.DataFrame({
    "text":  [f"row_{i}" for i in range(50)],
    "label": [1 if i < 5 else 0 for i in range(50)],  # 10 % positive
})
imbal_slice = EvalSlice(name="imbal", df=imbalanced)

cv = StratifiedKFoldSplitter(k=5, seed=42)
for fold in cv.iter_folds(imbal_slice):
    n_pos = int(fold["test"].y_true.sum())
    print(f"  test fold: n={len(fold['test'].df)} pos={n_pos}")
```

> **What NOT to do.** Don't pass an unstratified `KFold` for binary
> classification with rare positives. A fold with zero positives breaks
> PR-AUC entirely (single-class slices return `np.nan`).

(group-kfold)=
## Group-disjoint K-fold
**When to use.** Whenever rows cluster by an identifier with strong
within-cluster correlation: same patient, same user, same document
across paragraphs, same author, same source. Random K-fold leaks across
groups; group-disjoint K-fold doesn't.

**Primitive.**
[`GroupKFoldSplitter`](../api/splits.md) wraps
`sklearn.model_selection.GroupKFold`. Pass `group_col=` (column name in
the slice's dataframe) or `groups=` (numpy array at `iter_folds` call).

```python
from eval_toolkit import GroupKFoldSplitter

splitter = GroupKFoldSplitter(k=5, group_col="group")
for fold in splitter.iter_folds(parent):
    train_groups = set(fold["train"].df["group"].tolist())
    test_groups  = set(fold["test"].df["group"].tolist())
    assert not (train_groups & test_groups)
print("group-disjoint K-fold OK")
```

Pair with [`GroupLeakageCheck`](../api/leakage.md) (see
[leakage.md §6](leakage.md#group-leakage)) to catch the case where the
underlying dataset already has the same group ID in pre-existing splits.

(source-disjoint-kfold)=
## Source-disjoint K-fold (NEW in v0.7.0)
**When to use.** Stronger than group-disjoint when you want the test
fold's sources to **never appear in any training fold across the whole
CV procedure** — not just within a single fold. The pattern that
``prompt-injection-sdd`` hand-rolled (3-fold, 3 seeds, 9 runs total).

**Primitive.**
[`SourceDisjointKFoldSplitter`](../api/splits.md)
generalizes the pattern. Distinct sources are sorted, shuffled with the
seed, then round-robin assigned to K folds. Fold *i*'s test set = rows
whose source falls in bucket *i*.

**Difference from `GroupKFoldSplitter`.** GroupKFold only enforces
disjointness *within* a fold. Source-disjoint round-robin guarantees
that the *union* of test folds covers all sources, with each source
appearing in exactly one test fold across the procedure.

```python
from eval_toolkit import SourceDisjointKFoldSplitter

splitter = SourceDisjointKFoldSplitter(source_col="source", k=5, seed=42)
for i, fold in enumerate(splitter.iter_folds(parent)):
    test_sources = sorted(set(fold["test"].df["source"].tolist()))
    print(f"  fold {i} test sources: {test_sources}")
```

For multi-seed × CV (the prompt-injection-sdd 9-run pattern), wrap the
splitter in a seed loop in your harness call:

```python
from eval_toolkit.harness import evaluate_folded
# evaluate_folded(scorers, splitter, slice_, run_id="r", seeds=(1, 2, 3), ...)
```

(time-series-splits)=
## Time-aware splits
**When to use.** Whenever the data has a temporal dimension and the
production deployment will see *future* timestamps not in the training
set. Random K-fold mixes time periods and lets the model interpolate
across them; time-aware splits force the model to extrapolate.

**Primitive.**
[`TimeSeriesSplitter`](../api/splits.md) wraps
`sklearn.model_selection.TimeSeriesSplit`. Each fold's train set is
everything ≤ a moving boundary; the test set is the next chunk after.

```python
from eval_toolkit import TimeSeriesSplitter

splitter = TimeSeriesSplitter(k=4, time_col="t")
for i, fold in enumerate(splitter.iter_folds(parent)):
    max_train = fold["train"].df["t"].max()
    min_test  = fold["test"].df["t"].min()
    assert max_train < min_test
    print(f"  fold {i}: train≤{max_train}  test≥{min_test}")
```

Pair with
[`TemporalLeakageCheck`](../api/leakage.md) (see
[leakage.md §7](leakage.md#temporal-leakage)) to verify the invariant
end-to-end.

(cv-and-ood)=
## When CV alone is insufficient
K-fold CV with any random partitioning estimates *interpolation across
your sample*, not generalization to a new population. For
"out-of-distribution" claims (the model works on data shaped like the
production stream, not just like dev) you need **two** layers:

1. **Source-disjoint CV during development** — measures sensitivity to
   domain shift across your *known* axes (source, time, group).
2. **A separate locked final-holdout test set**, *never used* in
   development, drawn from a distribution as close to production as you
   can get.

The CV mean is your "expected dev-time performance under domain shift";
the holdout is your "honest single-shot release metric". Confusing the
two is what the literature calls *underspecification* — see Recht et al.
2019 and the
[Hidden Leaks in Time Series](https://arxiv.org/html/2512.06932v1) work
on validation-strategy leakage.

(splits-pitfalls)=
## Pitfalls / Common mistakes
- **Tuning hyperparameters on the test fold.** If your fold loop is
  outermost and your inner loop tunes hyperparameters using the test
  fold, your "K-fold mean" is a test-fold-leaky number. Use a *nested*
  CV (inner CV on the train fold for HP search, outer CV for reporting)
  or a separate validation slice carved off the train fold.
- **Reporting K-fold mean without CI.** Use
  [`cv_clt_ci`](../api/bootstrap.md) (or call
  `evaluate_folded(...)` which auto-computes it). A single mean number
  hides the cross-fold variance and is impossible to compare across
  papers.
- **Using `GroupKFold` when you actually want
  `SourceDisjointKFoldSplitter`.** The two have meaningfully different
  guarantees — see §source-disjoint above. If your concern is "test
  sources never seen in train", you want source-disjoint.
- **`shuffle=False` in `StratifiedKFold` with sorted-by-time data.**
  Without shuffling, the first folds get the earliest rows — accidentally
  recreating a time-aware split (probably not what you intended).
- **Splitting before deduplication.** If your dataset has duplicates and
  you split first, the duplicates might end up on opposite sides of the
  split — instant cross-split leakage. Run
  [`ExactDuplicateCheck`](../api/leakage.md) +
  [`NearDuplicateCheck`](../api/leakage.md) on the *full*
  dataset before splitting.

## Further reading

- Hastie, T., Tibshirani, R., & Friedman, J. *The Elements of Statistical
  Learning.* §7.10 (cross-validation methodology).
- Yan, X. et al. *Hidden Leaks in Time Series Forecasting.* arXiv 2025.
  [arXiv:2512.06932](https://arxiv.org/html/2512.06932v1) — formalizes
  validation-strategy leakage.
- Recht, B. et al. *Do ImageNet classifiers generalize to ImageNet?*
  ICML 2019. — empirical CV-vs-final-holdout divergence on a public
  benchmark.
- sklearn docs:
  [`StratifiedKFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html),
  [`GroupKFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html),
  [`TimeSeriesSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html).

See also: [leakage.md](leakage.md),
[comparison.md](comparison.md) (CV CI computation).