ACC: Compiling Agent Trajectories into Long-Context QA for Direct Reasoning

Agents solve problems by calling tools repeatedly: user asks a question, the model invokes a tool, gets data, calls another tool, and another... After a dozen interactions, it finally has enough information to answer.

But standard SFT training has a blind spot: it only trains turn-level tool selection, masking tool responses from gradient updates. All the evidence accumulated across dozens of turns — discarded during training.

ACC (Agent Context Compilation, arXiv:2605.21850, Qisheng Su et al., May 21, 2026) has a direct idea: compile these trajectories into QA pairs.

The compilation process

In a raw trajectory, the question and answer are separated by a dozen tool calls and environment observations. ACC concatenates the original question + all tool responses + environment observations into a single long-context QA pair, training the model to find the answer directly without tool access.

The key insight: these data are free. Agent trajectories naturally produced during operation become high-quality long-context training samples after ACC's compilation. Search agents, software engineering agents, database querying agents — any tool-calling agent works.

Results

On Qwen3-30B-A3B, after ACC training:

MRCR (cross-turn coreference resolution): 68.3, +18.1 over baseline
GraphWalks (graph traversal): 77.5, +7.6 over baseline

These results approach Qwen3-235B-A22B levels. A 30B model, through a data compilation method, catches up to a model 7x larger on long-context reasoning.

General capabilities on GPQA, MMLU-Pro, AIME, IFEval remain unchanged — ACC doesn't harm generalization.

Why it matters

Long-context training has two established routes: continue pretraining on long documents (costly) or synthesize context with heuristics (unstable quality). ACC provides a third path: use naturally accumulated multi-turn interaction data from agent runs — rich, real, and free.

Sources:

arXiv:2605.21850, ACC: Compiling Agent Trajectories for Long-Context Training, Qisheng Su et al., 2026-05-21

The compilation process

Results

Why it matters

Related

RLVR Credit Assignment, Revisited: DelTA Takes a Discriminator View on Token-Level Rewards

Do MLLMs Really Read People? MM-OCEAN Finds 51% of "Correct Ratings" Are Guessing

OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning