Skip to main content
Ctrl+K

eval-toolkit

  • Getting Started
  • What’s new
  • Examples
  • Worked example: metrics + bootstrap CIs
  • Worked example: slice-aware evaluate harness
    • Worked example: calibration with Platt + isotonic
    • Worked example: leakage detection
    • Worked example: claims + evidence gates
    • Worked example: paired bootstrap comparison
    • Worked example: prompt-injection classifier evaluation
    • Worked example: PyTorch + LoRA Scorer adapter
    • Worked example: declarative OOD slate loading
    • Worked example: character-injection adversarial sweep
    • Worked example: ActivationDeltaProbe (TaskTracker port)
    • Worked example: Spotlighting structural defenses
    • Worked example: RecallAtLowFPR loss training
    • Methodology
    • Splits
    • Comparison & confidence intervals
    • Reproducibility
    • Claims and Gates
    • Prediction Artifacts and Metric States
    • Evidence And Claims
    • Bootstrap
    • Calibration
    • Leakage
    • Text deduplication
    • Fairness & subgroup slicing
    • Stratified PR-AUC & the gap-flag report
    • Parallelism
    • Reading list
    • Testing your evaluation code
    • Threshold selection
    • Versioning Tier-2 implementations
    • API reference
    • eval_toolkit.analysis
    • eval_toolkit.artifacts
    • eval_toolkit.bootstrap
    • eval_toolkit.calibration
    • eval_toolkit.claims
    • eval_toolkit.config
    • eval_toolkit.docs
    • eval_toolkit.embeddings
    • eval_toolkit.evidence
    • eval_toolkit.harness
    • eval_toolkit.leakage
    • eval_toolkit.loaders
    • eval_toolkit.manifest
    • eval_toolkit.metrics
    • eval_toolkit.operating_points
    • eval_toolkit.paths
    • eval_toolkit.plotting
    • eval_toolkit.protocols
    • eval_toolkit.provenance
    • eval_toolkit.seeds
    • eval_toolkit.splits
    • eval_toolkit.text_dedup
    • eval_toolkit.thresholds
    • v0.6.x → v0.7.x migration
    • v0.7.x → v0.8.0 migration
    • v0.8.x → v0.9.0 migration
    • Extending eval-toolkit
    • Schema Reference
    • Roadmap
    • Repo Strategy
    • Deprecation policy
    • Migration guides
    • Releasing eval-toolkit
  • GitHub
  • PyPI
  • Getting Started
  • What’s new
  • Examples
  • Worked example: metrics + bootstrap CIs
  • Worked example: slice-aware evaluate harness
  • Worked example: calibration with Platt + isotonic
  • Worked example: leakage detection
  • Worked example: claims + evidence gates
  • Worked example: paired bootstrap comparison
  • Worked example: prompt-injection classifier evaluation
  • Worked example: PyTorch + LoRA Scorer adapter
  • Worked example: declarative OOD slate loading
  • Worked example: character-injection adversarial sweep
  • Worked example: ActivationDeltaProbe (TaskTracker port)
  • Worked example: Spotlighting structural defenses
  • Worked example: RecallAtLowFPR loss training
  • Methodology
  • Splits
  • Comparison & confidence intervals
  • Reproducibility
  • Claims and Gates
  • Prediction Artifacts and Metric States
  • Evidence And Claims
  • Bootstrap
  • Calibration
  • Leakage
  • Text deduplication
  • Fairness & subgroup slicing
  • Stratified PR-AUC & the gap-flag report
  • Parallelism
  • Reading list
  • Testing your evaluation code
  • Threshold selection
  • Versioning Tier-2 implementations
  • API reference
  • eval_toolkit.analysis
  • eval_toolkit.artifacts
  • eval_toolkit.bootstrap
  • eval_toolkit.calibration
  • eval_toolkit.claims
  • eval_toolkit.config
  • eval_toolkit.docs
  • eval_toolkit.embeddings
  • eval_toolkit.evidence
  • eval_toolkit.harness
  • eval_toolkit.leakage
  • eval_toolkit.loaders
  • eval_toolkit.manifest
  • eval_toolkit.metrics
  • eval_toolkit.operating_points
  • eval_toolkit.paths
  • eval_toolkit.plotting
  • eval_toolkit.protocols
  • eval_toolkit.provenance
  • eval_toolkit.seeds
  • eval_toolkit.splits
  • eval_toolkit.text_dedup
  • eval_toolkit.thresholds
  • v0.6.x → v0.7.x migration
  • v0.7.x → v0.8.0 migration
  • v0.8.x → v0.9.0 migration
  • Extending eval-toolkit
  • Schema Reference
  • Roadmap
  • Repo Strategy
  • Deprecation policy
  • Migration guides
  • Releasing eval-toolkit
  • GitHub
  • PyPI

Section Navigation

  • eval_toolkit.evidence.AggregateEvidence
  • eval_toolkit.evidence.EvidenceAxis
  • eval_toolkit.evidence.PairingMetadata
  • eval_toolkit.evidence.RECOMMENDED_SOURCE_ROLES
  • eval_toolkit.evidence
  • eval_toolkit.evidence.RECOMMENDED_SOURCE_ROLES

eval_toolkit.evidence.RECOMMENDED_SOURCE_ROLES#

.. currentmodule:: eval_toolkit.evidence

.. autodata:: RECOMMENDED_SOURCE_ROLES

previous

eval_toolkit.evidence.PairingMetadata

next

eval_toolkit.harness

Edit on GitHub
Show Source

© Copyright 2026, Brandon Behring.

Created using Sphinx 9.1.0.

Built with the PyData Sphinx Theme 0.17.1.