API reference#

This API reference is auto-generated from NumPy-style docstrings in src/eval_toolkit/. It is organized by the README’s three-tier architecture:

  • Tier 1 — Functional core: pure metric / bootstrap / calibration primitives. Take numpy arrays in, return numpy arrays / floats / dataclasses out. No filesystem, no IO.

  • Tier 2 — Protocol-based orchestration: composable building blocks (Scorer, Splitter, LeakageCheck, ThresholdSelector, DatasetLoader, EvidenceGate). The harness (evaluate) wires them together.

  • Tier 3 — Reproducibility scaffolding: NeurIPS-aligned manifests, versioned JSON schemas, seed management.

Use the navigation to drill into any module. Below, the headline symbols per tier with one-line summaries:

Tier 1: Functional core#

Metrics (metrics)#

  • pr_auc(y, score) — area under the precision-recall curve

  • roc_auc(y, score) — area under the ROC curve

  • brier_score(y, score) — strictly-proper scoring rule (mean squared error between probabilities and labels)

  • expected_calibration_error (+ debiased / L2 / equal-mass variants)

  • headline_metrics(y, score) — bundled {pr_auc, roc_auc, brier, ece, n, n_positive} for harness output

  • metrics_at_threshold(y, score, t) — precision / recall / F1 at a fixed decision threshold

Bootstrap & inference (bootstrap)#

  • bootstrap_ci(y, score, metric=...) — 95% BCa or percentile CI on any metric

  • paired_bootstrap_diff(y, s_a, s_b, metric=...) — significance test on the difference of two scorers (preserves within-sample correlation)

  • cv_clt_ci(fold_metrics) — CLT-based CI on cross-validated point estimates

  • mde_from_ci(result, alpha, power) — minimum detectable effect

  • delong_roc_variance(y, s_a, s_b) — DeLong’s nonparametric variance

Calibration (calibration)#

  • fit_platt_calibrator(y, score) — sigmoid scaling (Platt 1999)

  • fit_isotonic_calibrator(y, score) — monotone non-parametric fit

  • fit_temperature(...) — single-parameter temperature scaling

  • bayes_optimal_threshold(prior, fp_cost, fn_cost) — analytic cost-optimal threshold

Tier 2: Protocol-based orchestration#

Harness (harness)#

  • evaluate(scorers, slices, run_id=...) — slice-aware orchestrator; returns RunResult

  • evaluate_folded(splitter, scorers, slice_, ...) — CV variant

  • EvalSlice — DataFrame wrapper with configurable column names

  • RunResult — JSON-serializable run container (schema-versioned)

  • write_run_result(result, run_dir) — persist + schema-validate

Splitters (splits)#

  • Splitter Protocol

  • StratifiedKFoldSplitter, GroupKFoldSplitter, SourceDisjointKFoldSplitter, TimeSeriesSplitter, HoldoutSplitter, PurgedKFoldSplitter (with embargo)

  • compute_label_overlap(t_train, t_test, horizon) — audit utility

Leakage detection (leakage)#

  • LeakageCheck Protocol

  • ExactDuplicateCheck, NormalizedFormLeakageCheck, NearDuplicateCheck, CrossSplitLeakageCheck, GroupLeakageCheck, LabelConflictCheck, TemporalLeakageCheck

  • run_leakage_checks(checks, splits) -> LeakageReport

Threshold selection (thresholds)#

  • ThresholdSelector Protocol

  • MaxF1Selector, TargetRecallSelector, TargetPrecisionSelector, TargetFPRSelector, YoudenJSelector, CostSensitiveSelector, CISafeThresholdSelector

Loaders (loaders)#

  • DatasetLoader Protocol

  • DataFrameLoader, ParquetGlobLoader, SingleSliceLoader, HFDatasetsLoader

Claims + evidence (claims, evidence)#

  • ClaimSpec + evaluate_claims(result, [claim])

  • Pre-built gates: headline_present_gate, metric_threshold_gate, minimum_slice_size_gate, paired_diff_present_gate, no_leakage_errors_gate, no_scorer_errors_gate, required_scorer_gate, low_fpr_feasibility_gate, strict_artifact_gate, …

  • EvidenceAxis, AggregateEvidence for typed aggregation

Tier 3: Reproducibility scaffolding#

Manifest (manifest)#

  • RunManifest, build_manifest, write_manifest

  • NeurIPS Reproducibility Checklist-aligned (git_sha, seeds, code_versions, env, data_hashes, contamination_flags, etc.)

  • v3 schema with validate_manifest(payload)

Artifacts (artifacts)#

  • validate_manifest, validate_results, validate_prediction_artifact_ref

  • write_json_strict(path, payload) — atomic write with NaN/Inf rejection

  • sanitize_for_json(obj) — recursive cleanup of numpy types

Seeds + provenance (seeds, provenance)#

  • set_global_seeds(seed, strict_torch_determinism=...)

  • capture_git_sha(), compute_file_hash(path), make_run_dir(...)


Full module list#

Click any module name for the auto-generated full API reference.