The bottleneck of long-context inference lives entirely in the attention mechanism. KV cache grows linearly with context, attention computation grows quadratically — run prefill on a million-token input and your GPU fans take off.
Existing solutions are either native sparse training (train a sparse attention model from scratch, extremely costly) or heuristic token eviction (throw away some tokens at inference time, accuracy is a gamble). RTPurbo (arXiv:2605.16928, Yanke Zhou et al., May 16, 2026) says: neither is necessary.
Three observations
First, only a small subset of attention heads truly needs full long-context processing. Most heads are simply ineffective at long-range retrieval — their attention patterns are short and local. The ones doing actual retrieval are just a few "retrieval heads."
Second, long-range retrieval is governed primarily by a low-dimensional subspace. The paper proves that a 16-dimensional token indexer is sufficient for efficient relevant token retrieval. 16 dimensions, not 128, not full dimensionality.
Third, the useful token budget is strongly query-dependent. Some questions need only a few key tokens, others require scanning a large segment. So dynamic top-p selection is more appropriate than fixed top-k sparsification — let the model decide how much to look at.
What RTPurbo does
Based on these observations, RTPurbo's approach: retain full KV cache only for retrieval heads, use a lightweight token indexer for sparse attention on all other heads.
The key breakthrough: this transformation requires only hundreds of training steps. No sparse pretraining from scratch, no data pipeline rebuild. Take a pre-trained full-attention model, fine-tune for a few hundred steps, done.
Results
Near-lossless accuracy on long-context benchmarks and reasoning tasks. 9.36x prefill speedup at 1M context, 2.01x decode speedup.
9x means: if million-token prefill used to take 30 seconds, it now takes about 3. For applications processing long documents, codebases, or conversations, this isn't an "optimization" — it's a leap from "doesn't work" to "works."
Why it matters
For years, sparse attention research went down a heavy path: designing complex sparsity patterns, training from scratch, making trade-offs. RTPurbo says: the model is already sparse enough, you just need to "reveal" it.
This echoes a similar trajectory in pruning research — early work thought pruning required retraining or complex sparse constraints, then it turned out most models have massive redundancy and simple post-training pruning works. Sparse attention may be following the same path.
Sources:
- arXiv:2605.16928, Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps, Yanke Zhou et al., 2026-05-16