eval-toolkit#
Reusable evaluation contracts for binary classification — metrics, bootstrap confidence intervals, calibration, leakage detection, threshold selection, and a pluggable harness that ties them together.
Get started#
Get started
Examples#
Examples
- Examples
- Worked example: metrics + bootstrap CIs
- Worked example: slice-aware
evaluateharness - Worked example: calibration with Platt + isotonic
- Worked example: leakage detection
- Worked example: claims + evidence gates
- Worked example: paired bootstrap comparison
- Worked example: prompt-injection classifier evaluation
- Worked example: PyTorch + LoRA
Scoreradapter - Worked example: declarative OOD slate loading
- Worked example: character-injection adversarial sweep
- Worked example: ActivationDeltaProbe (TaskTracker port)
- Worked example: Spotlighting structural defenses
- Worked example: RecallAtLowFPR loss training
Methodology#
Methodology
- Methodology
- Splits
- Comparison & confidence intervals
- Reproducibility
- Claims and Gates
- Prediction Artifacts and Metric States
- Evidence And Claims
- Bootstrap
- Calibration
- Leakage
- Text deduplication
- Fairness & subgroup slicing
- Stratified PR-AUC & the gap-flag report
- Parallelism
- Reading list
- Testing your evaluation code
- Threshold selection
- Versioning Tier-2 implementations
API reference#
API reference
- API reference
eval_toolkit.analysiseval_toolkit.artifactseval_toolkit.bootstrapeval_toolkit.calibrationeval_toolkit.claimseval_toolkit.configeval_toolkit.docseval_toolkit.embeddingseval_toolkit.evidenceeval_toolkit.harnesseval_toolkit.leakageeval_toolkit.loaderseval_toolkit.manifesteval_toolkit.metricseval_toolkit.operating_pointseval_toolkit.pathseval_toolkit.plottingeval_toolkit.protocolseval_toolkit.provenanceeval_toolkit.seedseval_toolkit.splitseval_toolkit.text_dedupeval_toolkit.thresholds