AEM: Solving Credit Assignment in Multi-Turn Agent RL Without Extra Supervision

Multi-turn agent RL has an old problem: the environment only gives a result reward at task completion, with no feedback for individual steps. Credit assignment—how to distribute the final outcome across steps—is the bottleneck.

Common solutions introduce dense intermediate supervision: process reward models, auxiliary self-supervised signals. But this adds supervision complexity, tuning costs, and may limit cross-task generalization.

This paper takes a different path: no extra supervision, just adaptive entropy modulation for credit assignment.

How AEM Works

The authors lift entropy dynamics from the token level to the response level. The reasoning: in multi-turn agent RL, the environment is affected by a complete response, not individual tokens. Aligning uncertainty estimation at the response level reduces sensitivity to token-level sampling noise.

Further theoretical analysis reveals that entropy drift under natural-gradient updates is governed by the interaction between the sampled response's advantage and its relative surprisal. Based on this, AEM derives a practical response-level uncertainty proxy and uses it to rescale advantages, naturally transitioning from exploration to exploitation as the balance of positive and negative samples evolves.

Results

Tested on ALFWorld, WebShop, and SWE-bench-Verified with models from 1.5B to 32B. AEM consistently improved strong RL baselines, including a +1.4% gain when integrated into a state-of-the-art software engineering RL training framework.

My Take

Improving credit assignment without process reward models is the right direction. Process reward models require extra training and tuning—real costs in production. AEM only modifies advantage scaling, a small change with stable effects.

That said, +1.4% on SWE-bench-Verified isn't massive. For teams already running strong RL baselines, the cost-benefit of adding a new training logic needs careful calculation. For teams starting agent RL from scratch, AEM provides a starting point without extra supervision components.

Sources:

arXiv:2605.00425, "AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning", Haotian Zhao et al., May 2026

How AEM Works

Results

My Take

Related

ACC: Compiling Agent Trajectories into Long-Context QA for Direct Reasoning

RLVR Credit Assignment, Revisited: DelTA Takes a Discriminator View on Token-Level Rewards

Do MLLMs Really Read People? MM-OCEAN Finds 51% of "Correct Ratings" Are Guessing