C
ChaoBro

Tsinghua ZEDA: Skip Half the Experts in Pre-trained MoE Models via Self-Distillation, Boosting Inference Speed by 1.2x

Mixture of Experts (MoE) has become the standard architecture for large language models. However, MoE faces an awkward dilemma: it becomes "fixed" after training—the number of experts is static, and the number of experts activated per token is predetermined.

This means that even if a user asks a simple question like "What is 1+1?", the model still activates the same number of experts and consumes the same amount of compute.

The Tsinghua team's new work, ZEDA (Zero-Expert Self-Distillation Adaptation), aims to solve this inefficiency.

From Static to Dynamic: Teaching the Model to "Be Lazy"

The core idea is elegant: inject "zero-output experts" into a pre-trained MoE model—these experts do nothing and consistently output zero. Then, through self-distillation, the model learns to delegate simple tasks to these zero experts while reserving real experts for complex tasks.

It sounds simple, but implementation involves three main challenges:

1. Stability of Architecture Transition

Suddenly adding a batch of zero experts to a trained model would confuse it. The original routing weights were trained on a fixed number of experts, and changing the architecture would completely disrupt routing behavior.

ZEDA's solution is a two-stage self-distillation process:

  • Stage 1: Use the original MoE as a frozen teacher, allowing the new model to learn to maintain its original behavior
  • Stage 2: Introduce a group-level balancing loss to ensure load balance across experts, preventing all tokens from flooding into the zero experts

2. Design of Zero Experts

Zero experts aren't just hardcoded constant outputs. ZEDA injects parameterized zero experts—initialized to output zero, but capable of gradually "waking up" during training. This allows the model to dynamically determine the required compute based on task difficulty.

3. Adaptive Routing Strategy

The router in a dynamic MoE needs to learn to select varying numbers of experts for different inputs. ZEDA enables the router to automatically acquire this capability through reward signals during self-distillation—without requiring additional labeled data.

Empirical Results: Halving Compute with Negligible Accuracy Drop

Tests on the Qwen3-30B-A3B and GLM-4.7-Flash models show:

  • Eliminates over 50% of expert FLOPs—for simple tasks, most tokens activate only a minimal number of experts
  • Negligible accuracy loss—across 11 benchmarks covering math, coding, instruction following, etc., performance degradation remains within acceptable limits
  • Approximately 1.2x end-to-end inference speedup—considering this is merely a post-training adaptation, the speed improvement is quite substantial
  • Outperforms the strongest dynamic MoE baselines by 6.1 and 4.0 points—corresponding to the two models respectively

Why This Matters More Than It Sounds

Optimizing MoE inference costs is currently one of the core bottlenecks in the commercial deployment of large models. Leading players like Anthropic and OpenAI are all striving to "do more with fewer active parameters" in their MoE architectures.

ZEDA's unique value lies in the fact that it requires no training from scratch. Existing open-source MoE models like Qwen3 and GLM-4.7 can theoretically be directly adapted with ZEDA for a "slimming" process, instantly gaining inference acceleration.

This is particularly attractive for small-to-medium inference service providers—they don't need to invest tens of millions in training costs. Just a few days of self-distillation training can yield significant cost optimizations.

A Grounded Perspective

Of course, there are caveats to keep in mind:

  • The 50% FLOP reduction is "over" rather than "exactly"—the actual reduction ratio depends on the input distribution. It performs better with more simple tasks and worse with more complex ones
  • The 1.2x end-to-end speedup isn't massive in absolute terms, but considering ZEDA is a post-processing solution, the result is already quite solid
  • Currently at 5 Stars on GitHub, the code is likely still being organized

Primary Sources: