Full Attention Strikes Back: RTPurbo Transforms Full-Attention Models into Sparse Ones in Hundreds of Steps

The bottleneck of long-context inference lives entirely in the attention mechanism. KV cache grows linearly with context, attention computation grows quadratically — run prefill on a million-token input and your GPU fans take off.

Existing solutions are either native sparse training (train a sparse attention model from scratch, extremely costly) or heuristic token eviction (throw away some tokens at inference time, accuracy is a gamble). RTPurbo (arXiv:2605.16928, Yanke Zhou et al., May 16, 2026) says: neither is necessary.

Three observations

First, only a small subset of attention heads truly needs full long-context processing. Most heads are simply ineffective at long-range retrieval — their attention patterns are short and local. The ones doing actual retrieval are just a few "retrieval heads."

Second, long-range retrieval is governed primarily by a low-dimensional subspace. The paper proves that a 16-dimensional token indexer is sufficient for efficient relevant token retrieval. 16 dimensions, not 128, not full dimensionality.

Third, the useful token budget is strongly query-dependent. Some questions need only a few key tokens, others require scanning a large segment. So dynamic top-p selection is more appropriate than fixed top-k sparsification — let the model decide how much to look at.

What RTPurbo does

Based on these observations, RTPurbo's approach: retain full KV cache only for retrieval heads, use a lightweight token indexer for sparse attention on all other heads.

The key breakthrough: this transformation requires only hundreds of training steps. No sparse pretraining from scratch, no data pipeline rebuild. Take a pre-trained full-attention model, fine-tune for a few hundred steps, done.

Results

Near-lossless accuracy on long-context benchmarks and reasoning tasks. 9.36x prefill speedup at 1M context, 2.01x decode speedup.

9x means: if million-token prefill used to take 30 seconds, it now takes about 3. For applications processing long documents, codebases, or conversations, this isn't an "optimization" — it's a leap from "doesn't work" to "works."

Why it matters

For years, sparse attention research went down a heavy path: designing complex sparsity patterns, training from scratch, making trade-offs. RTPurbo says: the model is already sparse enough, you just need to "reveal" it.

This echoes a similar trajectory in pruning research — early work thought pruning required retraining or complex sparse constraints, then it turned out most models have massive redundancy and simple post-training pruning works. Sparse attention may be following the same path.

Sources:

arXiv:2605.16928, Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps, Yanke Zhou et al., 2026-05-16

Three observations

What RTPurbo does

Results

Why it matters

Related

ACC: Compiling Agent Trajectories into Long-Context QA for Direct Reasoning

RLVR Credit Assignment, Revisited: DelTA Takes a Discriminator View on Token-Level Rewards

Do MLLMs Really Read People? MM-OCEAN Finds 51% of "Correct Ratings" Are Guessing