Self-distillation for reasoning model training is well-known: pull the model toward a privileged-context version of itself (e.g., one that has access to the verified answer).
Makes sense in theory. But in practice on math reasoning, results are inconsistent.
This paper from the Rednote (Xiaohongshu) team did something good first—figured out why it fails before proposing a fix.
The problem is the privileged context itself
Pointwise Mutual Information (PMI) analysis revealed a counterintuitive finding:
Privileged context (knowing the answer) causes two biases:
- Over-confidence on structural tokens: connectives, verifiable claims—things derivable from the answer itself.
- Under-confidence on deliberation tokens: "Wait", "Let", "Maybe"—the exploration tokens that drive multi-step reasoning.
In other words: when you show the model the answer, it stops thinking. The "let me think" steps become unnecessary when the answer is already known.
It's like giving someone the answers before an exam—they memorize the right answers but lose the thinking practice.
Anti-SD: Go the opposite direction
The solution is surprisingly simple: maximize divergence between student and teacher, not minimize it.
This flips the per-token gradient direction, producing a naturally bounded advantage in one step. Plus an entropy-triggered gate that disables Anti-SD when the teacher's entropy collapses—a drop-in replacement.
Results
Five models from 4B to 30B on math reasoning benchmarks:
- Reaches GRPO baseline accuracy in 2-10x fewer training steps
- Final accuracy improved by up to 11.5 points
GRPO is already proven effective in the DeepSeek-R1 lineage. Anti-SD accelerating on top of it that much is significant.
Why this matters
Reasoning capability is the competitive frontier in 2026. Anti-SD offers a path where a model bootstraps its own reasoning through its training signal—valuable when you don't have a GPT-5 level teacher model available.
Paper: Anti-Self-Distillation for Reasoning RL Team: rednote-hilab (Xiaohongshu)