eval_toolkit.probes#

Activation probes — linear classifiers over transformer hidden states. ActivationDeltaProbe is a port of the TaskTracker methodology (Abdelnabi et al. 2024, arXiv:2406.00799) for prompt-injection detection via linear probing of activation deltas at a chosen transformer layer.

Optional dependency: pip install eval-toolkit[probes] (installs torch

  • transformers).

ActivationDeltaProbe

TaskTracker-style activation-delta linear probe.

ActivationExtractor

Extract a hidden-state vector for each input text.

Probe

Minimal probe surface: fit + predict on raw text inputs.