C
ChaoBro

SDAR: Solving GRPO’s Stability Issues by Integrating Self-Distillation with Agent Reinforcement Learning

A Key Pain Point in Agent Reinforcement Learning

Reinforcement learning has proven effective for post-training LLM agents—methods like GRPO have enabled models to make better decisions in tool use, web navigation, and question answering.

Yet GRPO suffers from a fundamental limitation: it provides reward signals only at the trajectory level. For a multi-turn interactive task, final success or failure is fed back as a single holistic signal; every token-level decision along the way receives only an extremely coarse supervisory signal.

It’s like a coach telling you only “You won” or “You lost” after the match—without indicating which round, or which move, went wrong.

The Allure and Pitfalls of Self-Distillation

On-Policy Self-Distillation (OPSD) offers a complementary approach: a privileged-context teacher branch generates dense, token-level guidance signals. Ideally, it delivers fine-grained feedback for each decision step.

However, directly applying OPSD to multi-turn agent scenarios introduces two critical issues:

First, instability accumulates across turns. In multi-turn interaction, errors compound at each step, rendering the teacher’s supervision signal itself unstable.

Second, the teacher can be wrong. When the teacher rejects an action, it’s unclear whether the action is truly suboptimal—or whether the rejection stems from the teacher’s own flawed skill retrieval.

SDAR’s Core Design: A Gated Auxiliary Objective

SDAR adopts an elegant strategy: rather than treating OPSD as the primary optimization target, it treats it as a gated auxiliary objective. RL remains the core optimization backbone, while OPSD contributes only supplementary token-level signals.

How exactly? SDAR maps the teacher’s token-level signals through a sigmoid gating function:

  • Tokens with “positive gap” (approved by the teacher): distillation signals are strengthened
  • Tokens with “negative gap” (rejected by the teacher): distillation signals are softly attenuated—not bluntly treated as negative examples

The elegance lies in acknowledging the teacher’s imperfection: a rejection may be justified—or it may reflect a misjudgment. Hence, SDAR avoids hard rejection and opts instead for soft attenuation.

Experimental Results

On Qwen2.5 and Qwen3 series models, SDAR consistently outperforms GRPO across three benchmarks:

Benchmark Improvement over GRPO
ALFWorld +9.4%
WebShop (Acc) +10.2%
Search-QA +7.0%

More importantly, SDAR avoids the instability of naive GRPO+OPSD combinations. The paper compares SDAR against multiple RL–OPSD hybrid baselines and demonstrates consistent superiority across model scales.

Why This Work Matters

Agent reinforcement learning is rapidly becoming the mainstream paradigm for LLM post-training. Following GRPO, the community has been actively seeking more robust multi-turn training methods. SDAR’s contribution lies in identifying two previously overlooked challenges of OPSD in multi-turn settings—cumulative instability and teacher misjudgment—and proposing a simple yet effective solution.

The gated auxiliary objective idea may hold broader relevance for RLHF/RLAIF scenarios: when external supervision signals are of uncertain quality, avoid letting them dominate training—instead, let them influence the main optimization process gently, via gating.


Primary Sources:

  • arXiv:2605.15155 SDAR
  • Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang, Jinyang Wu, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen