OpenComputer: Building a Verifiable Software World for Computer-Use Agents, 33 Apps, 1000 Tasks

Today's top spot on Hugging Face Daily Papers is taken by a paper from Arman Cohan's team at Yale University—OpenComputer. While the title sounds highly academic, it actually addresses a very practical pain point: how do we know if an AI agent is actually doing the right thing on a computer?

Computer-use agents (which allow AI to control mouse and keyboard to operate desktop software) are a major trend for 2025–2026. However, evaluating these agents has long been a persistent challenge—high scores on benchmarks like OSWorld-Verified do not guarantee that an agent can reliably complete end-to-end tasks in real-world scenarios.

Four Core Components

OpenComputer's architecture is built from four components, each targeting a weakness in existing solutions:

1. Application-Level State Verifiers

This is the most intriguing part of the paper. The team developed hard-coded state verifiers for 33 desktop applications (browsers, Office suites, creative software, development environments, file managers, and communication tools), which check the actual state of the applications via structured inspection endpoints.

Instead of having an LLM look at screenshots and guess the outcome, it directly checks whether "the file was saved," "the email was sent," or "the code compiled successfully."

2. Self-Evolving Verification Layer

The verifiers themselves also need to evolve. OpenComputer introduces a self-improvement layer that enhances verifier reliability through feedback during execution. Simply put: the verifiers learn from their own mistakes.

3. Task Generation Pipeline

Automatically synthesizes realistic, machine-verifiable desktop tasks. The 1,000 tasks cover a wide range of scenarios, from simple actions like "open a file" to complex, multi-step workflows.

4. Evaluation Harness

Records complete operation trajectories and calculates auditable partial-credit rewards. This is far more granular than a simple binary "success/failure" judgment.

Key Findings

The paper presents several surprising conclusions:

OpenComputer's hard-coded verifiers align significantly better with human judgment than LLM-as-judge approaches—especially when success depends on fine-grained application states
Frontier agents still struggle with end-to-end task completion, despite being capable of handling individual steps
There is a noticeable gap between open-source models' scores on OSWorld-Verified and their actual performance, exposing a persistent divide in the computer automation field

Why It Matters

The value of this paper lies not only in proposing a new framework, but also in directly confronting a fundamental question in agent evaluation: what exactly are we measuring?

As LLM-as-judge becomes the default evaluation method, OpenComputer demonstrates through experiments that: for tasks involving specific application states, hard-coded verifiers are more reliable than LLM judgments. This has significant implications for the entire agent research community.

Furthermore, with its coverage of 1,000 tasks and 33 applications, it stands as one of the most comprehensive evaluation frameworks for computer-use agents to date.

Paper link: arXiv:2605.19769

Four Core Components

1. Application-Level State Verifiers

2. Self-Evolving Verification Layer

3. Task Generation Pipeline

4. Evaluation Harness

Key Findings

Why It Matters

Related

APWA: A Distributed Architecture for True Parallelization in Multi-Agent Systems

Dual-Dimensional Consistency: A New Method to Save 10x Tokens During Inference-Time Scaling

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory Capabilities