Rednote's New Reasoning RL Approach: Don't Let the Student Imitate the Teacher—Let Them Diverge

Self-distillation for reasoning model training is well-known: pull the model toward a privileged-context version of itself (e.g., one that has access to the verified answer).

Makes sense in theory. But in practice on math reasoning, results are inconsistent.

This paper from the Rednote (Xiaohongshu) team did something good first—figured out why it fails before proposing a fix.

The problem is the privileged context itself

Pointwise Mutual Information (PMI) analysis revealed a counterintuitive finding:

Privileged context (knowing the answer) causes two biases:

Over-confidence on structural tokens: connectives, verifiable claims—things derivable from the answer itself.
Under-confidence on deliberation tokens: "Wait", "Let", "Maybe"—the exploration tokens that drive multi-step reasoning.

In other words: when you show the model the answer, it stops thinking. The "let me think" steps become unnecessary when the answer is already known.

It's like giving someone the answers before an exam—they memorize the right answers but lose the thinking practice.

Anti-SD: Go the opposite direction

The solution is surprisingly simple: maximize divergence between student and teacher, not minimize it.

This flips the per-token gradient direction, producing a naturally bounded advantage in one step. Plus an entropy-triggered gate that disables Anti-SD when the teacher's entropy collapses—a drop-in replacement.

Results

Five models from 4B to 30B on math reasoning benchmarks:

Reaches GRPO baseline accuracy in 2-10x fewer training steps
Final accuracy improved by up to 11.5 points

GRPO is already proven effective in the DeepSeek-R1 lineage. Anti-SD accelerating on top of it that much is significant.

Why this matters

Reasoning capability is the competitive frontier in 2026. Anti-SD offers a path where a model bootstraps its own reasoning through its training signal—valuable when you don't have a GPT-5 level teacher model available.

Paper: Anti-Self-Distillation for Reasoning RL Team: rednote-hilab (Xiaohongshu)

The problem is the privileged context itself

Anti-SD: Go the opposite direction

Results

Why this matters

Related

Chrome DevTools Officially Releases MCP Server: AI Coding Agents Can Finally "See" the Browser

Google I/O 2026: The "Agentification" of Search Isn't an Upgrade, It's a Rewrite

Google's SynthID Watermarking Technology Adopted by Giants Like OpenAI and Nvidia: AI Content Provenance Enters the Standardization Era