eval_toolkit.probes#
Activation probes — linear classifiers over transformer hidden states.
ActivationDeltaProbe is a port of the TaskTracker methodology
(Abdelnabi et al. 2024, arXiv:2406.00799) for prompt-injection detection
via linear probing of activation deltas at a chosen transformer layer.
Optional dependency: pip install eval-toolkit[probes] (installs torch
transformers).
|
TaskTracker-style activation-delta linear probe. |
|
Extract a hidden-state vector for each input text. |
|
Minimal probe surface: fit + predict on raw text inputs. |