Test-time scaling (TTS) research has an awkward problem: everyone is "designing" strategies. When should the model think longer? When should it branch? When should it stop? Researchers hand-craft heuristics by intuition, then burn compute to validate them.
This paper from Chengsong Huang's team flips the script: instead of designing strategies, design an environment where strategies can grow on their own.
The AutoTTS framework's core idea is straightforward—frame TTS strategy discovery as a controller synthesis problem. Pre-collect the model's reasoning trajectories and probe signals, then let an agent learn within that environment: when to branch, continue, probe, prune, or stop.
The discovery process doesn't require repeated LLM calls because the controller makes decisions on pre-collected data, making evaluation extremely cheap. A key design called beta parameterization makes the search tractable, and fine-grained execution trace feedback lets the agent diagnose why its TTS program failed.
The result: on mathematical reasoning benchmarks, the auto-discovered strategies outperform strong hand-designed baselines on the accuracy-cost tradeoff, and generalize to held-out benchmarks and different model scales.
Total cost? $39.9 and 160 minutes.
Why this matters
TTS is one of the hottest LLM optimization directions in 2026. OpenAI's o-series has already demonstrated the value of test-time compute at product scale, but strategy design remains a manual craft. If AutoTTS's approach holds up, it means TTS strategies can be searched automatically, just like training hyperparameters.
Caveats: experiments focus on mathematical reasoning—generalization to code generation and creative writing is unproven. Beta parameterization depends on the quality of pre-collected data, which is a hidden cost.
Code and data will be open-sourced. If you're working on LLM reasoning optimization, this paper offers a clear alternative path: stop hand-writing rules, start building discovery environments.
Sources:
- arXiv:2605.08083, "LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling", Tong Zheng et al., May 2026