What Scaling Laws Won’t Tell You: The Bigger the Model, The Weirder the Bugs
Scaling Laws tell us that model capability will steadily improve as parameters and data grow. But what Scaling Laws don’t tell you is that when model scale crosses a certain threshold, serving introduces probabilistic, extremely hard-to-reproduce garbled outputs.
Zhipu AI (THUDM) published a technical blog post on April 29 titled Scaling Pain: Debugging GLM-5 Serving at Scale, detailing their experience debugging large-scale inference issues with GLM-5. The post received 843 likes and 295 bookmarks, sparking widespread discussion in the community.
The Problem: Sporadic Garbled Output, Only at Scale
GLM-5 is a 744B parameter MoE model. On a single machine or a small cluster, everything works fine. But when deployed to a production-grade distributed cluster, the team encountered a bizarre issue:
Garbled text occasionally appeared in outputs, but the errors were extremely rare and hard to reproduce.
This wasn’t a common encoding issue or tokenization error—it only appeared under specific distributed serving configurations with a certain probability. The team spent significant effort building a reliable reproduction pipeline.
Debugging Methodology
Zhipu’s team shared a three-step debugging framework in their blog:
| Step | Method | Output |
|---|---|---|
| Reproduce | Build deterministic test cases, force trigger with specific seeds | Reproducible garbled output samples |
| Locate | Check tensor communication layer by layer in the distributed inference pipeline | Numerical drift between specific nodes |
| Fix | Adjust mixed precision strategy, introduce numerical stability guards | Garbled outputs eliminated, no performance loss |
The key finding: in large-scale MoE inference, inconsistent numerical precision across different experts can accumulate to a degree that affects output quality. This is especially pronounced under high concurrency.
Why This Matters
This blog is valuable because it’s one of the few first-hand disclosures of large model serving Scaling Pain. The industry is flooded with discussions about “model capabilities,” but shares about “how to make a 744B MoE model run stably in production” are scarce.
For enterprises and developers considering self-deploying domestic large models, this information is highly actionable:
- Don’t assume single-machine tests passing means production-ready: Distributed inference introduces entirely new failure modes
- Numerical stability is a hidden challenge for MoE: Under expert parallelism, precision drift between different GPUs gets amplified
- Building deterministic reproduction is more effective than blind tuning: Zhipu’s first step was building reproducible test cases, not modifying model code
Action Items
If you plan to deploy GLM-5.1 or similarly sized domestic MoE models in production:
- Stress-test before going live: Simulate production-level concurrency and watch for sporadic garbled outputs
- Monitor numerical precision: Check activation value distributions across different GPU nodes
- Reference Zhipu’s mixed precision strategy: Their approach of using FP32 instead of BF16 for certain layers is a practical reference point
- Follow THUDM updates: The fix has been merged into GLM-5’s open-source code
GLM-5.1 (released March 27) is already a mature version, achieving 94-95% of Claude Opus 4.6 levels on SWE-Bench. This blog is more of a “pitfall guide” for those who follow—engineering experience distilled from the pain of scaling.