---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

# Worked example: detector stacking with `LogisticStacker`

> **What this shows.** Combine three synthetic per-sample detectors into a
> calibrated ensemble using `LogisticStacker`, the v0.45.0 reference impl of
> the `MetaLearner` protocol. Demonstrates the canonical pipeline: per-detector
> scores → stacker fit → stacker `predict_proba` → optional second-stage
> calibration via `fit_platt_binary`.
>
> **Runtime:** <1 s. Pure sklearn + numpy; no optional dependencies.
> Closes [eval-toolkit#52](https://github.com/brandon-behring/eval-toolkit/issues/52).

## Why stack?

A single binary detector — a fine-tuned classifier, an activation probe,
an LLM-judge — captures one kind of signal. Stacking learns the best
(regularized) combination of multiple detectors' outputs on a held-out
*stacking* set, producing a meta-classifier that is at least as good as
the best base detector and usually better.

The toolkit ships one reference stacker, `LogisticStacker`, which wraps
`sklearn.linear_model.LogisticRegression` and exposes the v1.0 Tier-2
`MetaLearner` Protocol. Custom stackers (random forests, gradient boosting,
calibrated ensembles) can implement the same Protocol structurally.

## Setup

```{code-cell}
import numpy as np

from eval_toolkit import (
    LogisticStacker,
    MetaLearner,
    fit_platt_binary,
)
from eval_toolkit.metrics import pr_auc, roc_auc

rng = np.random.default_rng(0)
N = 800
```

## Synthetic three-detector data

We simulate three binary detectors with descending signal-to-noise on the
same `N`-sample stacking set.

```{code-cell}
y = rng.binomial(1, 0.3, size=N).astype(int)

# Detector 0: strong signal, low noise
det0 = np.clip(y * 0.75 + rng.normal(0, 0.15, N), 0, 1)
# Detector 1: medium signal, more noise
det1 = np.clip(y * 0.55 + rng.normal(0, 0.25, N), 0, 1)
# Detector 2: weak signal, near-random
det2 = np.clip(y * 0.30 + rng.normal(0, 0.40, N), 0, 1)

score_matrix = np.column_stack([det0, det1, det2])
score_matrix.shape
```

Baseline performance per detector:

```{code-cell}
for i, name in enumerate(["det0", "det1", "det2"]):
    pr = pr_auc(y, score_matrix[:, i])
    roc = roc_auc(y, score_matrix[:, i])
    print(f"{name}: PR-AUC={pr:.3f}, ROC-AUC={roc:.3f}")
```

## Fit the stacker

`LogisticStacker(C=1.0, class_weight='balanced')` is a reasonable default
for imbalanced binary detection. Smaller `C` (e.g. `0.1`) regularizes more
strongly, useful when the stacking set is small or detectors are noisy.

```{code-cell}
stacker = LogisticStacker(C=1.0, class_weight="balanced", rng=0)
stacker.fit(score_matrix, y)
print("coef_:", stacker.coef_)
print("intercept_:", stacker.intercept_)
print("classes_:", stacker.classes_.tolist())
```

The fitted weights reflect the signal ordering: `det0` (strongest) gets the
largest weight, `det2` (weakest) gets the smallest.

## Stacked predictions

```{code-cell}
stacked_proba = stacker.predict_proba(score_matrix)[:, 1]
stacked_pr = pr_auc(y, stacked_proba)
stacked_roc = roc_auc(y, stacked_proba)
print(f"Stacked: PR-AUC={stacked_pr:.3f}, ROC-AUC={stacked_roc:.3f}")
```

The stacker outperforms every base detector — that's the whole point.

## Second-stage calibration

Logistic regression's sigmoid output is well-calibrated on the *training*
distribution but can drift on a held-out test set. For downstream calibration
metrics (ECE, Brier), chain through `fit_platt_binary` on a disjoint
calibration set. Here we use the same data for illustration:

```{code-cell}
(a, b), apply_platt = fit_platt_binary(y, stacked_proba)
calibrated_proba = apply_platt(stacked_proba)

# Discrimination is unchanged (Platt is a monotone transform):
print(f"Pre-calibration  PR-AUC={pr_auc(y, stacked_proba):.4f}")
print(f"Post-calibration PR-AUC={pr_auc(y, calibrated_proba):.4f}")
print(f"Platt (a, b): ({a:.3f}, {b:.3f})")
```

For a full 4-calibrator audit (temperature / isotonic / Platt / Beta), see
the [calibration chapter](../methodology/calibration.md).

## Custom `MetaLearner` impls

The `MetaLearner` Protocol is structural — any class exposing `coef_`,
`classes_`, `intercept_`, `fit(score_matrix, y)`, `predict(score_matrix)`,
and `predict_proba(score_matrix)` satisfies it. Drop-in alternatives to
`LogisticStacker` include sklearn `GradientBoostingClassifier`, `RandomForest`,
or domain-specific ensemblers — wrap them with a thin Protocol-conformant
adapter and the rest of the harness works without changes.

```{code-cell}
# Verify structural Protocol satisfaction
assert isinstance(stacker, MetaLearner)
print("LogisticStacker satisfies MetaLearner ✓")
```

## References

- Wolpert, D. H. 1992. "Stacked generalization." *Neural Networks* 5(2),
  241–259. [doi:10.1016/S0893-6080(05)80023-1](https://doi.org/10.1016/S0893-6080(05)80023-1).
- Breiman, L. 1996. "Stacked regressions." *Machine Learning* 24(1),
  49–64.