Parallelism#
Background (skip if you’ve internalized this). Several toolkit primitives — especially the bootstrap family — spend the bulk of their wall-clock time in Python-level resample loops. For high
n_resamples(5 000+) on cheap metrics or anyn_resampleson expensive metrics (operating-point refit, ECE recomputation), parallel execution across CPU cores gives 3–10× speedup.
This chapter codifies eval-toolkit’s unified parallelism design — six
principles + one internal helper that every parallel-capable public API
calls into. The goal: contributors adding n_jobs to a new function
inherit a consistent backend, reproducibility contract, and error-surface
shape without re-inventing them.
Design principles (the six)#
1. Single backend#
All parallelism in the toolkit uses joblib with the loky backend.
Reasons:
joblibis already transitive via sklearn — no new dependency.lokyis process-based (sidesteps the GIL for CPU-bound metric calls).Helpful pickling-error messages save user time vs raw
multiprocessing.Matches the sklearn-eval ecosystem convention; callers’ mental model carries over.
Toolkit does not use concurrent.futures, raw multiprocessing,
threading, or asyncio for CPU-bound parallelism.
2. Single helper#
The single source of truth is
eval_toolkit._parallel.parallel_map() (internal — not exported in
__all__). Every public function that accepts n_jobs calls into
this helper rather than inlining its own joblib.Parallel invocation.
Why centralised:
One place to update if joblib’s API shifts
Consistent error messages across the toolkit
Smart-default semantics (auto-cap, guidance log, picklability sniff) apply uniformly
3. Opt-in per function#
n_jobs is an explicit keyword-only parameter on each
parallel-capable public function, defaulting to 1 (sequential).
def paired_bootstrap_diff(..., *, n_jobs: int = 1) -> PairedBootstrapCI: ...
No global config (no EVAL_TOOLKIT_N_JOBS env var, no
joblib.parallel_backend context-manager support). Reasoning:
Tests stay deterministic in wall-clock (no environment-dependent timing)
Callsite is auditable — readers can see at a glance whether a call is parallel
Matches Python stdlib convention (“explicit opt-in” for parallelism)
4. Default sequential#
n_jobs=1 (default) runs a pure-Python for-loop, preserving:
Traceback fidelity (no joblib worker-process indirection in stack traces)
Reproducibility (no race conditions; deterministic execution order)
Zero startup overhead (no loky pool spin-up)
For n_jobs == 1 AND len(items) >= 1000, the helper emits a one-
shot INFO log per Python process suggesting the caller might want
n_jobs > 1. This is the only INFO-level log site in the toolkit
(everything else is DEBUG); the threshold + once-per-process semantics
keep the noise floor near-zero for typical workloads.
5. Reproducibility contract#
When a parallel-capable function performs random resampling, it MUST use
numpy.random.SeedSequence.spawn() to derive per-item RNG seeds so
n_jobs > 1 produces bit-for-bit-identical output to n_jobs == 1
for the same caller-supplied seed:
seed_seq = np.random.SeedSequence(seed)
spawned = seed_seq.spawn(n_resamples)
deltas = parallel_map(_resample_step, spawned, n_jobs=n_jobs)
This is the invariant that callers rely on when they tune n_jobs
for speed. Every parallel-capable function in the toolkit has a
test_*_n_jobs_reproducibility test asserting this equivalence.
6. Picklability surface#
The helper does a pickle.dumps(fn) sniff-test up front and raises a
helpful TypeError when fn is unpicklable (lambdas, closures over
local state):
TypeError: parallel_map of paired bootstrap: callable is not picklable
(lambdas and closures over local state are not supported with n_jobs != 1).
Define a named top-level function. Underlying error: ...
This avoids the cryptic _pickle.PicklingError deep in a joblib worker
traceback.
Smart defaults summary#
|
Behavior |
|---|---|
|
Sequential pure-Python loop. INFO guidance log if |
|
Rejected — |
|
joblib convention: all available cores. Not capped. |
|
joblib loky with |
|
Capped with WARNING log; prevents CPU-frying when caller passes |
When to add n_jobs to a new function#
Add the kwarg when all of these are true:
The function has a Python-level loop over independent work units
Per-item cost is medium or higher (>10ms typical)
Typical iteration count is ≥1000 OR per-item cost is expensive enough that smaller counts still win (e.g., scorer.predict_proba per slice)
The loop body is picklable (no lambdas; doesn’t close over thread-local state)
Then:
Refactor the loop body into a module-level
_step_fnhelperUse
np.random.SeedSequence(seed).spawn(n)to seed each step (if random)Replace the loop with
parallel_map(_step_fn, items, n_jobs=n_jobs, description="...")Add a
test_*_n_jobs_reproducibilitytest asserting equivalenceUpdate the function’s docstring
Parameterssection with the standardn_jobsblurb (see existing bootstrap-fn docstrings as the template)
Currently parallel-capable functions#
As of v0.34.0:
Function |
n_jobs threshold |
Notes |
|---|---|---|
|
|
All resamples are independent |
|
|
Original issue #17 ask |
|
|
|
|
|
Per-resample threshold refit benefits most from parallelism |
|
Pass-through to internal |
Not yet parallelised (filed as follow-up issues):
harness._score_all_slicesandharness.evaluate_folded(slice × scorer loop) — Scorer picklability is the gatekeeper; see Scorer picklability below.harness._attach_transferred_operating_points(OperatingPointSpec loops) — same Scorer gatekeeper.text_dedup.MinHashLSHStrategyhash-bucket construction — race- condition design needed for per-worker accumulation.
Scorer picklability#
When the harness parallelization issues
(#29,
#30) land, the
work-unit dispatched to each loky worker bundles both a step function AND a
Scorer instance. joblib pickles the
entire delayed call — function plus bound arguments — so an unpicklable
Scorer fails at dispatch even when the step function itself is fine.
The existing parallel_map sniff (Principle #6 above) covers the function;
this section establishes the parallel contract for the Scorer surface.
Rule. Any Scorer passed to a parallel-capable harness call
(evaluate(..., n_jobs > 1) once #29/#30 land) MUST be picklable. The
toolkit’s existing helper raises a clean TypeError with the underlying
pickle error attached — no special exception subclass.
Picklable (works — top-level class, picklable state):
class ThresholdScorer:
def __init__(self, threshold: float) -> None:
self.threshold = threshold
def predict_proba(self, X):
# bound numpy / torch state is fine — both pickle cleanly
...
Not picklable (raises TypeError at dispatch):
def make_scorer(threshold):
def predict_proba(X):
# closure over `threshold` — joblib cannot pickle the inner fn
...
return predict_proba # ← returned closure is not picklable
Fix. Promote the closure to a top-level class (as in the picklable example above). Bound instance attributes pickle naturally; closures over local state do not.
Common non-picklable cases to watch for in user-supplied Scorers:
Closures (above) — most common
lambdaexpressions assigned to instance attributesLocal-scope class definitions (defined inside a function or method)
Attributes holding open file handles, sockets, or live model-server connections — re-establish these inside
predict_probainstead of caching onself
For deeper joblib pickling guidance see the joblib docs (link below).
Memory model: worker-copy semantics#
When n_jobs != 1, joblib’s loky backend forks a worker pool and
each worker receives a full copy of every argument bound at
delayed(fn)(arg) call time. There is no shared-memory channel for
caller objects — loky’s worker-to-parent IPC is pickle-based, so the
parent’s memory is duplicated, not aliased.
The implication for memory-bounded parallelism:
A
pd.DataFrameof size N MB held in each work item’s spec producesn_jobs × NMB of resident memory across the worker pool. On a 128-core / 247 GB machine, ann_jobs=-1sweep where each cell carries a 30 GB working set (BCa bootstrap intermediates on a multi-million-row corpus) projects to ~3.8 TB peak — the OOM killer fires before any cell finishes. This is not a joblib bug; it’s the cost of the picklability contract (Principle #6).NumPy arrays are subject to the same rule but loky transparently uses joblib’s memmap-based shared-memory fast path for arrays above a threshold (~1 MB by default). This means small NumPy arrays duplicate; large NumPy arrays auto-share via
/dev/shm. No such fast path exists for DataFrames — pandas objects always duplicate.Module-level / process-global state is NOT copied per call. Workers inherit the parent’s globals at fork time, then operate independently. State written to a module-level singleton inside one worker is invisible to others.
Recommended ceiling for DataFrame-bearing workloads#
When in doubt, cap n_jobs at min(8, available_RAM_GB / spec_size_GB).
For the 30 GB / cell case above, a 247 GB machine supports
min(8, 247/30) ≈ 8 workers — far below the 128 cores available, but
the limit is RAM, not compute. The parallel_map helper does not
auto-detect this; it’s a caller responsibility to size n_jobs
against memory headroom.
See also#
eval_toolkit.bootstrap— primary callersdocs/source/methodology/bootstrap.md— algorithmic background on bootstrap resamplingjoblibdocumentation — https://joblib.readthedocs.io/en/latest/parallel.html