Implementing Flash Attention in PyTorch

How IO-aware exact attention reduces memory traffic, how PyTorch selects scaled-dot-product attention kernels, and how to verify the backend you actually used.

Why standard attention runs out of memory

Scaled dot-product attention computes:

text
softmax(QKᵀ / √d) V

The arithmetic cost remains quadratic in sequence length. A straightforward implementation also materializes the full attention-score and probability matrices in high-bandwidth memory (HBM). Those intermediates become the immediate memory problem as sequences grow.

FlashAttention is an exact attention algorithm, not an approximation. Its key contribution is IO awareness: it tiles the calculation so blocks of queries, keys, and values can be processed through faster on-chip SRAM while avoiding full-sized intermediate matrices in HBM. The original paper reports both lower memory traffic and faster execution on supported workloads [1].

Prefer PyTorch scaled-dot-product attention

Current PyTorch exposes fused attention through torch.nn.functional.scaled_dot_product_attention. PyTorch can select among available implementations according to the inputs, hardware, dtype, and runtime configuration [2].

python
import torch
import torch.nn.functional as F

q = torch.randn(2, 16, 2048, 64, device="cuda", dtype=torch.float16)
k = torch.randn(2, 16, 2048, 64, device="cuda", dtype=torch.float16)
v = torch.randn(2, 16, 2048, 64, device="cuda", dtype=torch.float16)

output = F.scaled_dot_product_attention(
    q,
    k,
    v,
    dropout_p=0.0,
    is_causal=True,
)

Do not assume that calling this function guarantees the FlashAttention backend. Unsupported shapes, dtypes, devices, masks, or other constraints can lead PyTorch to choose a different implementation.

Request and verify a fused backend

For controlled experiments, use the current sdpa_kernel context manager [3]:

python
from torch.nn.attention import SDPBackend, sdpa_kernel

with sdpa_kernel(SDPBackend.FLASH_ATTENTION):
    output = F.scaled_dot_product_attention(
        q, k, v, dropout_p=0.0, is_causal=True
    )

Treat forcing a backend as a diagnostic technique rather than a universal production setting. Test representative sequence lengths, head dimensions, dtypes, masks, and GPU architectures. Record peak allocated memory and latency after warm-up, and compare outputs within a tolerance appropriate for the chosen precision.

A useful benchmark shape

A minimal benchmark should report:

GPU model and PyTorch/CUDA versions
batch size, head count, sequence length, and head dimension
dtype and causal/masking configuration
warm-up and measured iteration counts
median or percentile latency rather than a single run
peak allocated GPU memory

The important production lesson is narrower than "FlashAttention is faster": attention performance depends on memory movement, kernel eligibility, and workload shape. Measure the path your deployed inputs actually take.

Sources

Orchestrating Persistent Agent Memory in n8n →Related course →