API reference#
This API reference is auto-generated from NumPy-style docstrings in
src/eval_toolkit/. It is organized by the README’s
three-tier architecture:
Tier 1 — Functional core: pure metric / bootstrap / calibration primitives. Take numpy arrays in, return numpy arrays / floats / dataclasses out. No filesystem, no IO.
Tier 2 — Protocol-based orchestration: composable building blocks (Scorer, Splitter, LeakageCheck, ThresholdSelector, DatasetLoader, EvidenceGate). The harness (
evaluate) wires them together.Tier 3 — Reproducibility scaffolding: NeurIPS-aligned manifests, versioned JSON schemas, seed management.
Use the navigation to drill into any module. Below, the headline symbols per tier with one-line summaries:
Tier 1: Functional core#
Metrics (metrics)#
pr_auc(y, score)— area under the precision-recall curveroc_auc(y, score)— area under the ROC curvebrier_score(y, score)— strictly-proper scoring rule (mean squared error between probabilities and labels)expected_calibration_error(+ debiased / L2 / equal-mass variants)headline_metrics(y, score)— bundled{pr_auc, roc_auc, brier, ece, n, n_positive}for harness outputmetrics_at_threshold(y, score, t)— precision / recall / F1 at a fixed decision threshold
Bootstrap & inference (bootstrap)#
bootstrap_ci(y, score, metric=...)— 95% BCa or percentile CI on any metricpaired_bootstrap_diff(y, s_a, s_b, metric=...)— significance test on the difference of two scorers (preserves within-sample correlation)cv_clt_ci(fold_metrics)— CLT-based CI on cross-validated point estimatesmde_from_ci(result, alpha, power)— minimum detectable effectdelong_roc_variance(y, s_a, s_b)— DeLong’s nonparametric variance
Calibration (calibration)#
fit_platt_calibrator(y, score)— sigmoid scaling (Platt 1999)fit_isotonic_calibrator(y, score)— monotone non-parametric fitfit_temperature(...)— single-parameter temperature scalingbayes_optimal_threshold(prior, fp_cost, fn_cost)— analytic cost-optimal threshold
Tier 2: Protocol-based orchestration#
Harness (harness)#
evaluate(scorers, slices, run_id=...)— slice-aware orchestrator; returnsRunResultevaluate_folded(splitter, scorers, slice_, ...)— CV variantEvalSlice— DataFrame wrapper with configurable column namesRunResult— JSON-serializable run container (schema-versioned)write_run_result(result, run_dir)— persist + schema-validate
Splitters (splits)#
SplitterProtocolStratifiedKFoldSplitter,GroupKFoldSplitter,SourceDisjointKFoldSplitter,TimeSeriesSplitter,HoldoutSplitter,PurgedKFoldSplitter(with embargo)compute_label_overlap(t_train, t_test, horizon)— audit utility
Leakage detection (leakage)#
LeakageCheckProtocolExactDuplicateCheck,NormalizedFormLeakageCheck,NearDuplicateCheck,CrossSplitLeakageCheck,GroupLeakageCheck,LabelConflictCheck,TemporalLeakageCheckrun_leakage_checks(checks, splits) -> LeakageReport
Threshold selection (thresholds)#
ThresholdSelectorProtocolMaxF1Selector,TargetRecallSelector,TargetPrecisionSelector,TargetFPRSelector,YoudenJSelector,CostSensitiveSelector,CISafeThresholdSelector
Loaders (loaders)#
DatasetLoaderProtocolDataFrameLoader,ParquetGlobLoader,SingleSliceLoader,HFDatasetsLoader
Claims + evidence (claims, evidence)#
ClaimSpec+evaluate_claims(result, [claim])Pre-built gates:
headline_present_gate,metric_threshold_gate,minimum_slice_size_gate,paired_diff_present_gate,no_leakage_errors_gate,no_scorer_errors_gate,required_scorer_gate,low_fpr_feasibility_gate,strict_artifact_gate, …EvidenceAxis,AggregateEvidencefor typed aggregation
Tier 3: Reproducibility scaffolding#
Manifest (manifest)#
RunManifest,build_manifest,write_manifestNeurIPS Reproducibility Checklist-aligned (git_sha, seeds, code_versions, env, data_hashes, contamination_flags, etc.)
v3 schema with
validate_manifest(payload)
Artifacts (artifacts)#
validate_manifest,validate_results,validate_prediction_artifact_refwrite_json_strict(path, payload)— atomic write with NaN/Inf rejectionsanitize_for_json(obj)— recursive cleanup of numpy types
Seeds + provenance (seeds, provenance)#
set_global_seeds(seed, strict_torch_determinism=...)capture_git_sha(),compute_file_hash(path),make_run_dir(...)
Full module list#
Click any module name for the auto-generated full API reference.