Hugging Face released a tiny model called nanowhale. 110M parameters. Less than one-thousandth of DeepSeek-V4.
But don't skip past it too quickly. It replicates nearly all of DeepSeek-V4's key architectural components: MLA (Multi-Head Latent Attention), MoE (Mixture of Experts), Hyper-Connections, MTP (Multi-Token Prediction), SwiGLU activation, and RoPE + NoPE positional encoding.
In other words, this is a microscope for DeepSeek-V4's architecture.
Why you need a model this small
The most expensive thing in LLM research isn't training. It's debugging.
You tweak an attention mechanism parameter and want to run an ablation on a 671B model? Not realistic. Just loading the model requires dozens of H100s, let alone iterating a run takes days.
nanowhale's approach is simple: keep DeepSeek-V4's architectural skeleton intact, but scale every dimension down to the minimum. 110M parameters means you can run inference on consumer-grade GPUs or even CPUs, and train a round in minutes.
This is not a QA model. You can't use it to write code or translate. It has exactly two use cases:
Architecture research. Want to see what happens if you change MoE expert count from 256 to 512? Want to verify whether MLA still works at small parameter scales? Use nanowhale.
Education. The open source community has been missing a "small enough but with full modern architecture" teaching model. GPT-2 is too old, lacks MoE and MLA; now with nanowhale, students can run the complete DeepSeek-V4 architecture training flow on a notebook.
How closely does it replicate
nanowhale's architecture mapping is basically layer-for-layer:
| DeepSeek-V4 Component | nanowhale Implementation |
|---|---|
| MLA (128 head, latent dim 128) | MLA (4 head, latent dim 16) |
| 256 MoE experts (top-8 routing) | 8 MoE experts (top-2 routing) |
| Hyper-Connections (residual replacement) | Hyper-Connections (same structure) |
| MTP (multi-token prediction) | MTP (predicts 2 subsequent tokens) |
| SwiGLU FFN | SwiGLU FFN |
| RoPE + NoPE | RoPE + NoPE |
Proportions are scaled, but topology is unchanged. This means architectural behavior changes observed in nanowhale — like how changing routing strategy affects training stability — can likely be linearly extrapolated to larger models.
Of course, "likely" doesn't mean "certainly." The nonlinearity of scaling laws is well known, and findings on small models need re-verification on larger ones. But as a first-round screening tool, it's good enough.
What the community is already doing
nanowhale has been on Hugging Face for less than 24 hours and three derivative directions have already emerged:
- Quantization experiments. Someone is testing the impact of INT4 quantization on MoE routing precision — this kind of experiment costs too much on large models.
- Architecture ablation. Someone replaced Hyper-Connections with traditional residuals and directly compared training curve differences.
- Teaching notebooks. Two grad students already ran the full training flow from scratch for nanowhale on Colab, with under 2 hours of GPU time.
My take
nanowhale won't replace any production model. But its position in the open source community is similar to JAX's role in deep learning frameworks — not for using, but for understanding the mechanics underneath.
If you're doing architecture research, or want students to understand what the MoE + MLA + MTP combo actually does, nanowhale is currently the lowest-cost path.
One question worth watching: will Hugging Face release micro clones of other frontier architectures? If Qwen3.6's MoE variant or Kimi K2.6's architecture can also be scaled down to the 100M level, the open source community's experimentation capability would jump significantly.
Main sources: