Want a stronger model? The obvious move is adding parameters—more layers, wider widths, more attention heads. But that gets expensive fast.
Looped Transformer tries a different path: reuse the same Transformer blocks iteratively. Loop more at inference, get better results. No parameter increase, no longer context, just more compute time.
Sounds great. But there's a catch—when loop count goes up, training collapses.
Why Looped Transformers break
The paper diagnosed two culprits:
Gradient oscillation. After each loop, gradient direction swings wildly—like a car steering left and right erratically.
Residual explosion. Residual signals accumulate across loops and eventually overflow.
Previous looped models hit a ceiling where training simply couldn't continue—outputs became NaN, everything collapsed.
The fix
Two changes, zero extra parameters:
Fully looped architecture. Distribute inter-loop signals across all layers instead of adding them at a fixed position. This mitigates residual explosion—signals spread evenly through the network.
Attention injection. Repurpose the existing attention module to suppress gradient oscillation. Not a new module, just a different use of the existing one.
Result? Previous models collapsed at 12 loops. Fully Looped Transformer trained stably. Even in milder settings where baselines don't collapse, it improved downstream task performance by up to 13.2%.
Inference compute as a dial
The most interesting implication isn't the performance numbers—it's what this enables.
Since loop count is adjustable at inference, you can use one model and dynamically decide how much compute to spend per query.
Simple queries: 1-2 loops. Complex reasoning: 8-12 loops. No need to deploy multiple models of different sizes, just adjust one hyperparameter.
Isn't this the simplest version of test-time compute scaling? No complex reasoning training like o1, just a stably trained looped architecture.
Paper: Fully Looped Transformer