C
ChaoBro

Anti-Self-Distillation: Inverse Self-Distillation Accelerates Reasoning RL Training by 2–10×

Anti-Self-Distillation: Inverse Self-Distillation Accelerates Reasoning RL Training by 2–10×

The finding in this paper is highly counterintuitive—yet, upon reflection, it makes perfect sense.

A Peculiar Failure Pattern

The paper begins with an observed phenomenon: on-policy self-distillation works well in some domains but proves unstable for mathematical reasoning.

The core idea of self-distillation is for a student model to learn from a copy of itself—one equipped with privileged context (e.g., verified solutions or feedback). No stronger external teacher is required; the model teaches itself.

Yet on mathematical reasoning tasks, this approach frequently fails.

PMI Analysis: The Problem Lies in the “Privileged Context” Itself

The team uncovered the root cause using pointwise mutual information (PMI) analysis: privileged context itself inflates the teacher’s confidence in certain tokens—structural connectives and verifiable statements already implied by the solution—while simultaneously suppressing confidence in reasoning deliberation tokens, such as “Wait”, “Let”, and “Maybe”: words that drive multi-step search.

In short: showing the model the answer makes it more confident in writing answer-formatted tokens—but less willing to spend time “thinking”.

Anti-SD: Doing the Opposite

Anti-Self-Distillation (AntiSD) takes a direct approach: if self-distillation’s push toward student–teacher alignment harms performance, then let the student diverge from the teacher—increasing, rather than decreasing, divergence.

Concretely, AntiSD flips the sign of each token’s advantage term, naturally yielding a bounded advantage in one step. It also introduces an entropy-triggered gating mechanism: the term is disabled when teacher entropy collapses—enabling a drop-in replacement for standard self-distillation.

Performance Results

The numbers are compelling:

  • Evaluated across 5 models ranging from 4B to 30B parameters on mathematical reasoning benchmarks
  • AntiSD reaches the same accuracy as the GRPO baseline in 2–10× fewer training steps
  • Final accuracy improves by up to +11.5 points

Why This Matters

The paper’s core contribution extends beyond proposing a better training method—it exposes a fundamental contradiction in self-distillation for reasoning tasks: showing the model the answer may actually weaken its reasoning capability.

AntiSD opens a path toward scalable self-improvement—where language models guide their own reasoning development using internal training signals. This has potential paradigm-shifting implications for the training methodology of reasoning models.

Paper: arXiv:2605.11609