Worked example: detector stacking with `LogisticStacker`#

What this shows. Combine three synthetic per-sample detectors into a calibrated ensemble using LogisticStacker, the v0.45.0 reference impl of the MetaLearner protocol. Demonstrates the canonical pipeline: per-detector scores → stacker fit → stacker predict_proba → optional second-stage calibration via fit_platt_binary.

Runtime: <1 s. Pure sklearn + numpy; no optional dependencies. Closes eval-toolkit#52.

Why stack?#

A single binary detector — a fine-tuned classifier, an activation probe, an LLM-judge — captures one kind of signal. Stacking learns the best (regularized) combination of multiple detectors’ outputs on a held-out stacking set, producing a meta-classifier that is at least as good as the best base detector and usually better.

The toolkit ships one reference stacker, LogisticStacker, which wraps sklearn.linear_model.LogisticRegression and exposes the v1.0 Tier-2 MetaLearner Protocol. Custom stackers (random forests, gradient boosting, calibrated ensembles) can implement the same Protocol structurally.

Setup#

import numpy as np

from eval_toolkit import (
    LogisticStacker,
    MetaLearner,
    fit_platt_binary,
)
from eval_toolkit.metrics import pr_auc, roc_auc

rng = np.random.default_rng(0)
N = 800

Synthetic three-detector data#

We simulate three binary detectors with descending signal-to-noise on the same N-sample stacking set.

y = rng.binomial(1, 0.3, size=N).astype(int)

# Detector 0: strong signal, low noise
det0 = np.clip(y * 0.75 + rng.normal(0, 0.15, N), 0, 1)
# Detector 1: medium signal, more noise
det1 = np.clip(y * 0.55 + rng.normal(0, 0.25, N), 0, 1)
# Detector 2: weak signal, near-random
det2 = np.clip(y * 0.30 + rng.normal(0, 0.40, N), 0, 1)

score_matrix = np.column_stack([det0, det1, det2])
score_matrix.shape

(800, 3)

Baseline performance per detector:

for i, name in enumerate(["det0", "det1", "det2"]):
    pr = pr_auc(y, score_matrix[:, i])
    roc = roc_auc(y, score_matrix[:, i])
    print(f"{name}: PR-AUC={pr:.3f}, ROC-AUC={roc:.3f}")

det0: PR-AUC=0.999, ROC-AUC=1.000
det1: PR-AUC=0.898, ROC-AUC=0.943
det2: PR-AUC=0.503, ROC-AUC=0.689

Fit the stacker#

LogisticStacker(C=1.0, class_weight='balanced') is a reasonable default for imbalanced binary detection. Smaller C (e.g. 0.1) regularizes more strongly, useful when the stacking set is small or detectors are noisy.

stacker = LogisticStacker(C=1.0, class_weight="balanced", rng=0)
stacker.fit(score_matrix, y)
print("coef_:", stacker.coef_)
print("intercept_:", stacker.intercept_)
print("classes_:", stacker.classes_.tolist())

coef_: [8.2625404  3.44106179 1.41988891]
intercept_: [-4.61328624]
classes_: [0, 1]

/home/runner/work/eval-toolkit/eval-toolkit/.venv/lib/python3.13/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.
  warnings.warn(

The fitted weights reflect the signal ordering: det0 (strongest) gets the largest weight, det2 (weakest) gets the smallest.

Stacked predictions#

stacked_proba = stacker.predict_proba(score_matrix)[:, 1]
stacked_pr = pr_auc(y, stacked_proba)
stacked_roc = roc_auc(y, stacked_proba)
print(f"Stacked: PR-AUC={stacked_pr:.3f}, ROC-AUC={stacked_roc:.3f}")

Stacked: PR-AUC=1.000, ROC-AUC=1.000

The stacker outperforms every base detector — that’s the whole point.

Second-stage calibration#

Logistic regression’s sigmoid output is well-calibrated on the training distribution but can drift on a held-out test set. For downstream calibration metrics (ECE, Brier), chain through fit_platt_binary on a disjoint calibration set. Here we use the same data for illustration:

(a, b), apply_platt = fit_platt_binary(y, stacked_proba)
calibrated_proba = apply_platt(stacked_proba)

# Discrimination is unchanged (Platt is a monotone transform):
print(f"Pre-calibration  PR-AUC={pr_auc(y, stacked_proba):.4f}")
print(f"Post-calibration PR-AUC={pr_auc(y, calibrated_proba):.4f}")
print(f"Platt (a, b): ({a:.3f}, {b:.3f})")

Pre-calibration  PR-AUC=1.0000
Post-calibration PR-AUC=1.0000
Platt (a, b): (13.752, -7.466)

For a full 4-calibrator audit (temperature / isotonic / Platt / Beta), see the calibration chapter.

Custom `MetaLearner` impls#

The MetaLearner Protocol is structural — any class exposing coef_, classes_, intercept_, fit(score_matrix, y), predict(score_matrix), and predict_proba(score_matrix) satisfies it. Drop-in alternatives to LogisticStacker include sklearn GradientBoostingClassifier, RandomForest, or domain-specific ensemblers — wrap them with a thin Protocol-conformant adapter and the rest of the harness works without changes.

# Verify structural Protocol satisfaction
assert isinstance(stacker, MetaLearner)
print("LogisticStacker satisfies MetaLearner ✓")

LogisticStacker satisfies MetaLearner ✓

References#

Wolpert, D. H. 1992. “Stacked generalization.” Neural Networks 5(2), 241–259. doi:10.1016/S0893-6080(05)80023-1.
Breiman, L. 1996. “Stacked regressions.” Machine Learning 24(1), 49–64.

Worked example: detector stacking with LogisticStacker#