Knowledge Distillation is nothing new. Simply put, it involves having a large model (the teacher) teach a smaller model (the student), enabling the smaller model to approximate the capabilities of the larger one while maintaining a much smaller footprint.
However, this new paper from the Tencent Hunyuan team asks a question that hasn't been systematically studied before: Under an On-Policy setting, how much "locked" model efficiency can distillation actually unlock?
What is On-Policy Distillation?
Let's start with some background.
In reinforcement learning, On-Policy means the agent only uses data generated by its current policy to update itself. Off-Policy, on the other hand, can utilize data generated by any past policy.
The core idea of On-Policy Distillation is: conducting distillation learning on data generated by the model itself, rather than on a fixed dataset.
This sounds intuitive—having a model learn in its "area of expertise" should be more efficient. But the real questions are: exactly how efficient? Under what conditions? And how large is the efficiency gap between different strategies?
Until now, no one has systematically answered these questions.
Key Findings of the Paper
The Tencent Hunyuan team conducted extensive experiments, yielding several noteworthy findings:
First, the efficiency advantage of On-Policy distillation is not uniform. On certain tasks, On-Policy distillation shows a significantly greater efficiency boost compared to Off-Policy distillation; on others, the gap is minimal. This indicates that the choice of distillation strategy must be tailored to the specific task characteristics—there is no "one-size-fits-all" solution.
Second, the key to "unlocking efficiency" lies in the match of data distributions. When the distribution of the distillation data closely aligns with the data distribution the model encounters in real-world usage, the advantages of On-Policy distillation are maximized. This makes intuitive sense—the closer your practice material is to the actual exam, the better your results will naturally be.
Third, iterative distillation outperforms one-shot distillation. The paper found that progressively distilling through multiple rounds (where each round uses the updated model to generate new distillation data) can continuously unlock model potential. This process is akin to "self-improvement"—each round builds upon and surpasses the previous one.
Implications for the Industry
The value of this paper extends beyond academic discovery; it holds significant guiding implications for practical engineering.
Cost Optimization. The training costs for large models are continuously rising, making any method that improves training efficiency directly economically valuable. If On-Policy distillation can reduce the number of training steps while maintaining equivalent performance, the saved compute and time translate into real, tangible benefits.
Unlocking Small Model Capabilities. In many scenarios, deploying the largest model is impractical due to cost, latency, or deployment constraints. Distillation is a key technology for enabling smaller models to approach the capabilities of larger ones. Understanding the efficiency boundaries of On-Policy distillation helps us make more precise trade-offs between "model size" and "performance."
RLHF Pipeline Optimization. The concept of On-Policy distillation shares similarities with PPO training in RLHF—both rely on generating data from the current policy to update the model. The findings in this paper could provide valuable references for optimizing RLHF workflows.
Caveats and Considerations
Of course, it's important to maintain a level-headed perspective on these results:
Limitations in Experimental Scope. The paper's conclusions are based on specific model architectures and task settings. Changing the model or the domain may mean the conclusions don't fully apply.
Trade-offs in Computational Overhead. While On-Policy distillation can improve efficiency, each round requires the model to generate data, which itself incurs computational costs. In practical applications, a comprehensive evaluation of "distillation gains" versus "generation costs" is necessary.
Risk of Overfitting. Repeatedly learning from data generated by the model itself may cause it to over-adapt to specific data distributions, thereby reducing generalization capabilities. While the paper mentions several mitigation strategies, careful validation remains essential during actual deployment.
Final Thoughts
The greatest contribution of this Tencent Hunyuan paper may not be a specific technical metric, but rather bringing a previously overlooked question to the forefront.
Over the past few years, the industry has been chasing "larger models, more data, and greater compute." But this paper serves as a reminder: efficiency matters just as much. If you can achieve the same results with fewer resources, that in itself is a competitive advantage.
In 2026, as compute costs continue to climb and the industry increasingly focuses on return on investment, this kind of systematic research into "efficiency" is precisely a sign of the sector's maturation.