eval_toolkit.harness#

DEFAULT_BOOTSTRAP_RESAMPLES

int([x]) -> integer int(x, base=10) -> integer

RUN_RESULT_SCHEMA_VERSION

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

EvalSlice

A single eval slice (dev test, OOD slice, ablation slice, etc.).

RunResult

Outcome of a full evaluation run.

evaluate

Run every scorer on every slice; return a pure RunResult (no IO).

evaluate_folded

Run a fold aggregator: Splitter × seeds RunResult with CV-CI summary.

evaluate_scorer_on_slice

Score one scorer on one slice; return headline + bootstrap CI on PR-AUC.

with_claim_report

Return a copy of result with a serialized claim report attached.

write_run_result

Write a RunResult to run_dir as two JSON files (compact + full).