eval-toolkit#
Reusable evaluation contracts for binary classification — metrics, bootstrap confidence intervals, calibration, leakage detection, threshold selection, and a pluggable harness that ties them together.
Get started#
Get started
Examples#
Examples
- Examples
- Worked example: metrics + bootstrap CIs
- Worked example: slice-aware
evaluateharness - Worked example: calibration with Platt + isotonic
- Worked example: leakage detection
- Worked example: claims + evidence gates
- Worked example: paired bootstrap comparison
- Worked example: prompt-injection classifier evaluation
- Worked example: PyTorch + LoRA
Scoreradapter - Worked example: declarative OOD slate loading
- Worked example: character-injection adversarial sweep
- Worked example: ActivationDeltaProbe (TaskTracker port)
- Worked example: Spotlighting structural defenses
- Worked example: RecallAtLowFPR loss training
Methodology#
Methodology
- Methodology
- Splits
- Comparison & confidence intervals
- Reproducibility
- Claims and Gates
- Prediction Artifacts and Metric States
- Evidence And Claims
- Bootstrap
- Calibration
- Leakage
- Text deduplication
- Fairness & subgroup slicing
- Stratified PR-AUC & the gap-flag report
- Parallelism
- Reading list
- Testing your evaluation code
- Threshold selection
- Versioning Tier-2 implementations
API reference#
API reference
- API reference
eval_toolkit.adversarialeval_toolkit.analysiseval_toolkit.artifactseval_toolkit.audit_citation_alignmenteval_toolkit.audit_sister_doc_concept_drifteval_toolkit.audit_value_bindingseval_toolkit.bootstrapeval_toolkit.calibrationeval_toolkit.claimseval_toolkit.configeval_toolkit.docseval_toolkit.edaeval_toolkit.embeddingseval_toolkit.evidenceeval_toolkit.harnesseval_toolkit.leakageeval_toolkit.loaderseval_toolkit.losseseval_toolkit.manifesteval_toolkit.metric_specseval_toolkit.metricseval_toolkit.operating_pointseval_toolkit.pathseval_toolkit.plottingeval_toolkit.preprocessingeval_toolkit.probeseval_toolkit.protocols- Strict Tier-2 Protocols at v1.0
eval_toolkit.provenancescorecardfamily — primary metric surface (v0.46+)eval_toolkit.seedseval_toolkit.splitseval_toolkit.stackingsweep— unified text-transform enumeration (v0.47)eval_toolkit.text_dedupeval_toolkit.thresholds