Fudan × PKU Propose AHE: Let Harness Evolve Itself, Beating Codex in 10 Rounds

The days of human engineers tweaking Harnesses might be numbered.

When we talk about Harness Engineering, we usually assume a premise: humans design the Harness, and the Agent executes within it. We write rules, set constraints, add feedback loops, and watch the Agent work inside this cage.

But the Agentic Harness Engineering (AHE) framework, newly proposed by Fudan University, Peking University, and Shanghai Qiji Zhifeng, flips this premise — letting the Agent read its own traces, find problems, modify its Harness, and validate whether the changes actually work in the next round of evaluation.

From “Humans Tweaking Harness” to “Agents Tweaking Harness”

The core logic of this paper is intuitive: since the Agent is already executing tasks, it knows best where it gets stuck and where it fails. Instead of humans staring at millions of tokens of execution traces to manually patch things, why not let the Agent do it itself?

AHE’s workflow is a closed loop:

Observability: The Agent reads its complete execution trace
Diagnosis: Analyzes what went wrong — was it the wrong tool call? Were constraints too tight or too loose?
Modification: Automatically modifies the Harness configuration, prompts, or workflow
Validation: Verifies on Terminal-Bench 2 whether the modified pass@1 actually improves

Results: Surpassing Codex-CLI in 10 Rounds

The experimental data is straightforward:

Starting point: Agent with initial Harness, Terminal-Bench 2 pass@1 at 69.7%
After 10 rounds of automated evolution: pass@1 improved to 77.0%
Comparison: Surpassed the human-designed Codex-CLI Harness

What does this mean? It means Harness Engineering itself is evolving from a “craft” to an “automatable process.” What human engineers might take weeks to figure out in Harness optimization, AHE accomplishes in just a few iterations.

Industry Significance

This paper adds a crucial footnote to the Harness Engineering热潮 of early 2026:

Harness is no longer static: Previously, we thought of Harness as relatively fixed infrastructure — models change, Harness needs adjustment. AHE proves that Harness can adapt to task distributions on its own, even evolving continuously.
Agent gets stronger without model changes: AHE’s improvements come entirely from automated evolution at the Harness layer, with the model itself unchanged. This reinforces the 2026 consensus — Harness is the core variable determining Agent capability.
Another leap in engineering efficiency: When Harness can modify itself, developers only need to define evaluation criteria and an initial framework, leaving the rest to iterative loops. This has huge value for rapidly adapting to new models and toolchains.

Paper Information

Title: Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent
Institutions: Fudan University, Peking University, Shanghai Qiji Zhifeng
Benchmark: Terminal-Bench 2

This paper might be a significant turning point for Harness Engineering, moving from “manual crafting” to “automated evolution.” For teams building Agent systems, its open-source implementation and future progress are worth watching.

From “Humans Tweaking Harness” to “Agents Tweaking Harness”

Results: Surpassing Codex-CLI in 10 Rounds

Industry Significance

Paper Information

Related

MiniMax M2.7 Deep Dive: The Model That Trains Itself

DeepSeek V4 Pro API 75% Off, Unlocks 1M Context in Claude Code / OpenClaw

Moonshot AI Announces Kimi K3: 2.5 Trillion Parameters, Targeting Global Top-Tier Models