OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning

GRPO's approach to token-level credit assignment is fundamentally blunt: it assigns the same trajectory-level advantage to every token in a response. Think of a company bonus distributed equally regardless of contribution.

OPPO (Oracle-Prompted Policy Optimization, arXiv:2605.21851, Yu Li et al., May 21, 2026) starts from a clean observation: the oracle signal used in on-policy distillation methods for local token discrimination is actually a natural Bayesian update of the model's belief about whether it will eventually succeed.

The core insight

When an LLM generates a reasoning chain, each token is essentially a bet on "am I heading toward the right answer?" Prior distillation methods evaluate each token in isolation. OPPO accumulates evidence along the trajectory, maintaining a running estimate of success probability at every position.

The math works out cleanly: accumulating the oracle signal yields a token-level advantage in closed form, with no learned value network and no additional rollouts. Just one extra forward pass.

A first-order analysis factorizes the advantage into two components: the per-token discrimination signal that distillation methods already use, and a state weight that concentrates credit on genuinely pivotal tokens. This gives a directional variance-reduction guarantee.

Two estimators

OPPO offers two flavors. The self-oracle reuses the student model — which recovers on-policy distillation as a strict special case. The teacher-oracle delegates scoring to a stronger frozen model, yielding better discrimination.

Results

Across two base LLMs and seven benchmarks spanning mathematics, science, and code reasoning, OPPO improves over GRPO, DAPO, and SDPO. The gains widen monotonically with response length — which makes sense: the longer the reasoning chain, the more valuable it becomes to know where the pivotal steps are.

Why it matters

DelTA and OPPO arrived on the same day, both attacking token-level credit assignment from different angles. Neither trains a value network. Both show that GRPO's equal-credit-per-token assumption is leaving performance on the table.

Sources:

arXiv:2605.21851, OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning, Yu Li et al., 2026-05-21

The core insight

Two estimators

Results

Why it matters

Related

ACC: Compiling Agent Trajectories into Long-Context QA for Direct Reasoning

RLVR Credit Assignment, Revisited: DelTA Takes a Discriminator View on Token-Level Rewards

Do MLLMs Really Read People? MM-OCEAN Finds 51% of "Correct Ratings" Are Guessing