AutoTTS: Letting LLMs Discover Their Own Optimal Reasoning Strategy for $40

Test-time scaling (TTS) research has an awkward problem: everyone is "designing" strategies. When should the model think longer? When should it branch? When should it stop? Researchers hand-craft heuristics by intuition, then burn compute to validate them.

This paper from Chengsong Huang's team flips the script: instead of designing strategies, design an environment where strategies can grow on their own.

The AutoTTS framework's core idea is straightforward—frame TTS strategy discovery as a controller synthesis problem. Pre-collect the model's reasoning trajectories and probe signals, then let an agent learn within that environment: when to branch, continue, probe, prune, or stop.

The discovery process doesn't require repeated LLM calls because the controller makes decisions on pre-collected data, making evaluation extremely cheap. A key design called beta parameterization makes the search tractable, and fine-grained execution trace feedback lets the agent diagnose why its TTS program failed.

The result: on mathematical reasoning benchmarks, the auto-discovered strategies outperform strong hand-designed baselines on the accuracy-cost tradeoff, and generalize to held-out benchmarks and different model scales.

Total cost? $39.9 and 160 minutes.

Why this matters

TTS is one of the hottest LLM optimization directions in 2026. OpenAI's o-series has already demonstrated the value of test-time compute at product scale, but strategy design remains a manual craft. If AutoTTS's approach holds up, it means TTS strategies can be searched automatically, just like training hyperparameters.

Caveats: experiments focus on mathematical reasoning—generalization to code generation and creative writing is unproven. Beta parameterization depends on the quality of pre-collected data, which is a hidden cost.

Code and data will be open-sourced. If you're working on LLM reasoning optimization, this paper offers a clear alternative path: stop hand-writing rules, start building discovery environments.

Sources:

arXiv:2605.08083, "LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling", Tong Zheng et al., May 2026

Why this matters

Related

ACC: Compiling Agent Trajectories into Long-Context QA for Direct Reasoning

RLVR Credit Assignment, Revisited: DelTA Takes a Discriminator View on Token-Level Rewards

Do MLLMs Really Read People? MM-OCEAN Finds 51% of "Correct Ratings" Are Guessing