# API reference This API reference is auto-generated from NumPy-style docstrings in `src/eval_toolkit/`. It is organized by the README's [three-tier architecture](../index.md#three-tier-architecture): - **Tier 1 — Functional core**: pure metric / bootstrap / calibration primitives. Take numpy arrays in, return numpy arrays / floats / dataclasses out. No filesystem, no IO. - **Tier 2 — Protocol-based orchestration**: composable building blocks (Scorer, Splitter, LeakageCheck, ThresholdSelector, DatasetLoader, EvidenceGate). The harness (`evaluate`) wires them together. - **Tier 3 — Reproducibility scaffolding**: NeurIPS-aligned manifests, versioned JSON schemas, seed management. Use the navigation to drill into any module. Below, the headline symbols per tier with one-line summaries: ## Tier 1: Functional core ### Metrics ([`metrics`](metrics.md)) - `pr_auc(y, score)` — area under the precision-recall curve - `roc_auc(y, score)` — area under the ROC curve - `brier_score(y, score)` — strictly-proper scoring rule (mean squared error between probabilities and labels) - `expected_calibration_error` (+ debiased / L2 / equal-mass variants) - `headline_metrics(y, score)` — bundled `{pr_auc, roc_auc, brier, ece, n, n_positive}` for harness output - `metrics_at_threshold(y, score, t)` — precision / recall / F1 at a fixed decision threshold ### Bootstrap & inference ([`bootstrap`](bootstrap.md)) - `bootstrap_ci(y, score, metric=...)` — 95% BCa or percentile CI on any metric - `paired_bootstrap_diff(y, s_a, s_b, metric=...)` — significance test on the *difference* of two scorers (preserves within-sample correlation) - `cv_clt_ci(fold_metrics)` — CLT-based CI on cross-validated point estimates - `mde_from_ci(result, alpha, power)` — minimum detectable effect - `delong_roc_variance(y, s_a, s_b)` — DeLong's nonparametric variance ### Calibration ([`calibration`](calibration.md)) - `fit_platt_calibrator(y, score)` — sigmoid scaling (Platt 1999) - `fit_isotonic_calibrator(y, score)` — monotone non-parametric fit - `fit_temperature(...)` — single-parameter temperature scaling - `bayes_optimal_threshold(prior, fp_cost, fn_cost)` — analytic cost-optimal threshold ## Tier 2: Protocol-based orchestration ### Harness ([`harness`](harness.md)) - `evaluate(scorers, slices, run_id=...)` — slice-aware orchestrator; returns `RunResult` - `evaluate_folded(splitter, scorers, slice_, ...)` — CV variant - `EvalSlice` — DataFrame wrapper with configurable column names - `RunResult` — JSON-serializable run container (schema-versioned) - `write_run_result(result, run_dir)` — persist + schema-validate ### Splitters ([`splits`](splits.md)) - `Splitter` Protocol - `StratifiedKFoldSplitter`, `GroupKFoldSplitter`, `SourceDisjointKFoldSplitter`, `TimeSeriesSplitter`, `HoldoutSplitter`, `PurgedKFoldSplitter` (with embargo) - `compute_label_overlap(t_train, t_test, horizon)` — audit utility ### Leakage detection ([`leakage`](leakage.md)) - `LeakageCheck` Protocol - `ExactDuplicateCheck`, `NormalizedFormLeakageCheck`, `NearDuplicateCheck`, `CrossSplitLeakageCheck`, `GroupLeakageCheck`, `LabelConflictCheck`, `TemporalLeakageCheck` - `run_leakage_checks(checks, splits) -> LeakageReport` ### Threshold selection ([`thresholds`](thresholds.md)) - `ThresholdSelector` Protocol - `MaxF1Selector`, `TargetRecallSelector`, `TargetPrecisionSelector`, `TargetFPRSelector`, `YoudenJSelector`, `CostSensitiveSelector`, `CISafeThresholdSelector` ### Loaders ([`loaders`](loaders.md)) - `DatasetLoader` Protocol - `DataFrameLoader`, `ParquetGlobLoader`, `SingleSliceLoader`, `HFDatasetsLoader` ### Claims + evidence ([`claims`](claims.md), [`evidence`](evidence.md)) - `ClaimSpec` + `evaluate_claims(result, [claim])` - Pre-built gates: `headline_present_gate`, `metric_threshold_gate`, `minimum_slice_size_gate`, `paired_diff_present_gate`, `no_leakage_errors_gate`, `no_scorer_errors_gate`, `required_scorer_gate`, `low_fpr_feasibility_gate`, `strict_artifact_gate`, ... - `EvidenceAxis`, `AggregateEvidence` for typed aggregation ## Tier 3: Reproducibility scaffolding ### Manifest ([`manifest`](manifest.md)) - `RunManifest`, `build_manifest`, `write_manifest` - NeurIPS Reproducibility Checklist-aligned (git_sha, seeds, code_versions, env, data_hashes, contamination_flags, etc.) - v3 schema with `validate_manifest(payload)` ### Artifacts ([`artifacts`](artifacts.md)) - `validate_manifest`, `validate_results`, `validate_prediction_artifact_ref` - `write_json_strict(path, payload)` — atomic write with NaN/Inf rejection - `sanitize_for_json(obj)` — recursive cleanup of numpy types ### Seeds + provenance ([`seeds`](seeds.md), [`provenance`](provenance.md)) - `set_global_seeds(seed, strict_torch_determinism=...)` - `capture_git_sha()`, `compute_file_hash(path)`, `make_run_dir(...)` --- ## Full module list Click any module name for the auto-generated full API reference.