# Recipe: flash_attention_2 graceful fallback

**Pattern:** transformer scorers that prefer `flash_attention_2` for
speed must degrade gracefully when the GPU class doesn't support it —
otherwise the same code that runs on H100 / A100 breaks on RTX A6000
or smaller GPUs in the runpod-deploy GPU-failover pool.

## Why this is a recipe, not a schema feature

GPU-class detection and attention-implementation selection are
*consumer-domain* concerns. They depend on the model architecture, the
PyTorch/Transformers version pinned by your project, and what counts as
an acceptable degraded mode for your evaluation. None of that is
deployment metadata.

What `runpod-deploy` owns is the *failover pool*: it picks an available
GPU class from `pod.gpu_order` and provisions it. What attention
implementation your code uses on that GPU is yours to decide. Baking
the try/except into the orchestrator would force one fallback policy
on every consumer; consumers who genuinely need flash-attn-2 (e.g.,
for paper-grade timing comparisons) would have to opt out of the
fallback they didn't ask for.

## Pattern (Python)

```python notest
import torch
from transformers import AutoModel

try:
    encoder = AutoModel.from_pretrained(
        model_id,
        revision=revision,
        attn_implementation="flash_attention_2",
        torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    )
except (ValueError, ImportError):
    # flash-attention-2 not available on this GPU class; fall back to
    # stock SDPA. Keep dtype + revision the same so determinism survives.
    encoder = AutoModel.from_pretrained(
        model_id,
        revision=revision,
        torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    )
```

The try/except costs nothing at runtime when flash-attn-2 *is* supported
(the import / construct succeeds on first try) and turns a hard
ValueError into a logged degraded mode on smaller GPUs.

## What lives where

| Concern | Owner |
|---|---|
| Selecting an available GPU class from `pod.gpu_order` | `runpod-deploy` (failover loop in `provider.select_gpu_across_datacenters`) |
| Detecting GPU-class capabilities (FA2 support, SM compute capability) | Your model-loading code |
| Choosing the attention implementation | Your model-loading code |
| Logging which implementation was actually used (per-shard audit) | Your training code (emit a `events-query`-readable event) |
| Aggregating fallback frequency across shards | Your post-run analysis (`events-query` or custom forensics) |

## Anti-pattern to avoid

**Do not let the model-load fail hard when flash-attn-2 isn't
available.** `pod.gpu_order` typically lists several GPU classes
(failover for stock-outs). A sweep may land on an H100 for one shard
and an A6000 for the next. Without the fallback, the second shard
fails at model load with a `ValueError: flash_attention_2 is not
supported`, the orchestrator pulls a stack trace, and the operator
gets a billed failure for a portable-code bug.

**Do not bake the fallback into `runpod-deploy` itself.** The right
choice between flash-attn-2, SDPA, and eager attention depends on
your model + your tolerance for degraded performance. Consumers
running paper-grade timing comparisons may explicitly *want* the
hard-fail so they catch GPU-class drift early; consumers running
production evals want the graceful fallback. Both are legitimate.

## See also

- [`multi-config-sweep.md`](multi-config-sweep.md) — sweeps that
  span GPU classes (the typical case for `pod.gpu_order` with
  multiple entries) hit this exact failure mode without the fallback.
- [`reproducibility.md`](reproducibility.md) — log which attention
  implementation was used per shard for audit purposes; pair with
  `events.emit_event("attn_impl", ...)` from your training code so
  `events-query` can later answer "which shards fell back?".