C
ChaoBro

Tencent's LPO: Unifying Group-Based RLVR Strategy Gradients into a Single Geometric Framework

Tencent's LPO: Unifying Group-Based RLVR Strategy Gradients into a Single Geometric Framework

RLVR (Reinforcement Learning with Verifiable Rewards) is already the standard approach for LLM post-training—sample a group of responses per prompt, update the policy with group-relative advantage signals. But most people just use it; few ask: what are these methods actually doing geometrically?

Tencent Hunyuan's paper answers that question.

A Unified Geometric View

The paper reveals a key finding: existing group-based RLVR methods—regardless of their names or formulas—share the same geometric structure. Each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation.

In other words, everyone is doing the same thing, just without making it explicit.

Based on this insight, the authors propose LPO (Listwise Policy Optimization), which makes the target-projection explicit: restrict the proximal RL objective to the response simplex, then project the policy via exact divergence minimization.

Two Key Properties

LPO provides monotonic improvement guarantees on the listwise objective, with bounded, zero-sum, self-correcting projection gradients. It also supports flexible divergence selection, with different divergences having distinct structural properties.

Across diverse reasoning tasks and LLM backbones, LPO consistently outperforms typical policy gradient baselines under matched targets, while intrinsically preserving optimization stability and response diversity.

Why it matters

RLVR training instability is a well-known problem. Methods like GRPO and REINFORCE++ work but come with high tuning costs and training volatility. LPO re-unifies these methods from a geometric perspective, offering a more stable alternative.

If you have hands-on experience with LLM RL training, this paper's theoretical framework should help you see the essence of existing methods while providing a more stable training path.

Sources:

  • arXiv:2605.06139, "Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex", Yun Qu et al. (Tencent Hunyuan), May 2026