Alibaba Tongyi Lab ToolCUA: Teaching Computer Use Agents "When to Call an API vs. When to Click the Mouse"

Computer Use Agents (CUAs) have been incredibly popular over the past couple of years. Anthropic pioneered this by adding computer use capabilities to Claude, allowing models to operate a computer just like a human—moving the mouse, clicking buttons, and typing on the keyboard. However, researchers quickly ran into an awkward problem: CUAs are too "by-the-book."

They can click and type, but when faced with a task like "convert this 50-page PDF to Word and send it to the client," they will dutifully copy and paste page by page, rather than simply calling a conversion API to get it done in one step.

This is the core dilemma of the hybrid action space: an Agent can perform atomic-level GUI operations (clicking, typing) or call high-level tools (APIs, scripts), but it doesn't know which path to choose and when.

The ToolCUA paper from Alibaba Tongyi Lab targets exactly this problem.

Bridging the Data Gap with Synthesis

What's the biggest obstacle to training a CUA? Data.

High-quality trajectories that alternate between GUI operations and tool calls are extremely scarce. You can collect pure GUI operation data (screen recordings, automation scripts), but collecting trajectories involving real tool calls is both expensive and fragile—every tool has its own interface, error patterns, and authentication workflows.

ToolCUA's approach is clever: repurpose a large amount of static GUI trajectories to synthesize a grounded tool library.

Specifically, they designed an "Interleaved GUI-Tool Trajectory Expansion Pipeline." They start with existing GUI trajectories (e.g., from datasets like OpenWebVoyager), automatically identify action segments that can be replaced by tools, and swap them out for the corresponding tool calls. This way, a massive amount of interleaved GUI-tool trajectories can be generated without manual annotation or running real tool executions.

This isn't just simple data augmentation, but rather structured trajectory reconstruction—it preserves the original task semantics while altering the execution path.

Three-Stage Training: From Imitation to Autonomous Exploration

ToolCUA's training follows a three-step process, with each stage being more challenging and closer to real-world scenarios than the last.

Stage 1: SFT Warm-up. Supervised Fine-Tuning (SFT) is performed using the synthesized trajectories, teaching the model a preliminary understanding of "what to use and when." This step lays the foundation, essentially giving the Agent basic intuition.

Stage 2: Tool-Bootstrapped GUI RFT (Reinforcement Fine-Tuning). This is the crucial step. The model undergoes single-turn RL optimization at critical GUI-tool switching points—instead of attempting credit assignment across entire long trajectories (which is notoriously difficult), it focuses locally on the "switching decision." Simply put, it trains the model to make more accurate choices at the fork between "should I keep clicking or should I call an API?"

Stage 3: Online Agentic RL. The model is placed into a high-fidelity GUI-tool sandbox environment for autonomous exploration. The reward function here is a carefully designed "Tool-Efficient Path Reward"—it evaluates not just whether the task is completed, but also how efficient the path was. Points are awarded for using tools to skip steps, and deducted for using them without gaining efficiency.

The brilliance of this design lies in the fact that it doesn't directly reward the act of "using a tool," but rather rewards "efficiency gains brought by using a tool." This prevents the model from becoming overly reliant on tools or avoiding them altogether.

Results: A 66% Relative Improvement

On the OSWorld-MCP benchmark, ToolCUA achieved an accuracy of 46.85%, representing a ~66% relative improvement over the baseline model. Even more notably, it showed a 3.9% improvement over a pure GUI setup—demonstrating that GUI-tool orchestration isn't just about "having an extra option," but actually delivering complementary gains.

Among models of comparable scale, this represents the new state-of-the-art (SOTA).

Why This Path is the Right One

The significance of ToolCUA doesn't lie in leaderboard chasing, but in its validation of a paradigm: training in hybrid action spaces is feasible, and it comes with a clear methodology.

Previously, CUA researchers either focused on GUI operations (visual grounding, click coordinate prediction) or on tool calling (API selection, parameter filling). ToolCUA bridges the two, and not by simply stitching them together, but by using phased training to enable the model to truly master the higher-order capability of "path selection."

The paper is already open-sourced, with code available on GitHub and model weights hosted on Hugging Face (mPLUG/ToolCUA-8B). Achieving this performance with just 8B parameters makes it highly cost-effective.

For anyone building Agents, ToolCUA's training paradigm—Synthesized Trajectory Expansion → Local Switch-Point RL → Global Autonomous Exploration—is well worth serious consideration.

Paper: arXiv:2605.12481 Project Page: https://x-plug.github.io/ToolCUA/ Code: https://github.com/X-PLUG/ToolCUA Model: https://huggingface.co/mPLUG/ToolCUA-8B

Bridging the Data Gap with Synthesis

Three-Stage Training: From Imitation to Autonomous Exploration

Results: A 66% Relative Improvement

Why This Path is the Right One

Related

Chrome DevTools Officially Releases MCP Server: AI Coding Agents Can Finally "See" the Browser

Google I/O 2026: The "Agentification" of Search Isn't an Upgrade, It's a Rewrite

Google's SynthID Watermarking Technology Adopted by Giants Like OpenAI and Nvidia: AI Content Provenance Enters the Standardization Era