Skip to main content
Back to top
Ctrl
+
K
eval-toolkit
Getting Started
What’s new
Examples
Worked example: metrics + bootstrap CIs
Worked example: slice-aware evaluate harness
More
Worked example: calibration with Platt + isotonic
Worked example: leakage detection
Worked example: claims + evidence gates
Worked example: paired bootstrap comparison
Worked example: prompt-injection classifier evaluation
Worked example: PyTorch + LoRA Scorer adapter
Worked example: declarative OOD slate loading
Worked example: character-injection adversarial sweep
Worked example: ActivationDeltaProbe (TaskTracker port)
Worked example: Spotlighting structural defenses
Worked example: RecallAtLowFPR loss training
Methodology
Splits
Comparison & confidence intervals
Reproducibility
Claims and Gates
Prediction Artifacts and Metric States
Evidence And Claims
Bootstrap
Calibration
Leakage
Text deduplication
Fairness & subgroup slicing
Stratified PR-AUC & the gap-flag report
Parallelism
Reading list
Testing your evaluation code
Threshold selection
Versioning Tier-2 implementations
API reference
eval_toolkit.analysis
eval_toolkit.artifacts
eval_toolkit.bootstrap
eval_toolkit.calibration
eval_toolkit.claims
eval_toolkit.config
eval_toolkit.docs
eval_toolkit.embeddings
eval_toolkit.evidence
eval_toolkit.harness
eval_toolkit.leakage
eval_toolkit.loaders
eval_toolkit.manifest
eval_toolkit.metrics
eval_toolkit.operating_points
eval_toolkit.paths
eval_toolkit.plotting
eval_toolkit.protocols
eval_toolkit.provenance
eval_toolkit.seeds
eval_toolkit.splits
eval_toolkit.text_dedup
eval_toolkit.thresholds
v0.6.x → v0.7.x migration
v0.7.x → v0.8.0 migration
v0.8.x → v0.9.0 migration
Extending eval-toolkit
Schema Reference
Roadmap
Repo Strategy
Deprecation policy
Migration guides
Releasing eval-toolkit
Search
Ctrl
+
K
GitHub
PyPI
Search
Ctrl
+
K
Getting Started
What’s new
Examples
Worked example: metrics + bootstrap CIs
Worked example: slice-aware evaluate harness
Worked example: calibration with Platt + isotonic
Worked example: leakage detection
Worked example: claims + evidence gates
Worked example: paired bootstrap comparison
Worked example: prompt-injection classifier evaluation
Worked example: PyTorch + LoRA Scorer adapter
Worked example: declarative OOD slate loading
Worked example: character-injection adversarial sweep
Worked example: ActivationDeltaProbe (TaskTracker port)
Worked example: Spotlighting structural defenses
Worked example: RecallAtLowFPR loss training
Methodology
Splits
Comparison & confidence intervals
Reproducibility
Claims and Gates
Prediction Artifacts and Metric States
Evidence And Claims
Bootstrap
Calibration
Leakage
Text deduplication
Fairness & subgroup slicing
Stratified PR-AUC & the gap-flag report
Parallelism
Reading list
Testing your evaluation code
Threshold selection
Versioning Tier-2 implementations
API reference
eval_toolkit.analysis
eval_toolkit.artifacts
eval_toolkit.bootstrap
eval_toolkit.calibration
eval_toolkit.claims
eval_toolkit.config
eval_toolkit.docs
eval_toolkit.embeddings
eval_toolkit.evidence
eval_toolkit.harness
eval_toolkit.leakage
eval_toolkit.loaders
eval_toolkit.manifest
eval_toolkit.metrics
eval_toolkit.operating_points
eval_toolkit.paths
eval_toolkit.plotting
eval_toolkit.protocols
eval_toolkit.provenance
eval_toolkit.seeds
eval_toolkit.splits
eval_toolkit.text_dedup
eval_toolkit.thresholds
v0.6.x → v0.7.x migration
v0.7.x → v0.8.0 migration
v0.8.x → v0.9.0 migration
Extending eval-toolkit
Schema Reference
Roadmap
Repo Strategy
Deprecation policy
Migration guides
Releasing eval-toolkit
GitHub
PyPI
Search
Error
Please activate JavaScript to enable the search functionality.
Ctrl
+
K