26M parameters. That's roughly one sixty-thousandth of GPT-3.
Cactus Compute has open-sourced Needle, which they describe as distilling Gemini 3.1 into a function calling model that runs on phones, watches, and even smart glasses.
This isn't a "theoretically runnable" small model. Their production benchmarks (on their Cactus inference framework) show: prefill speed of 6000 toks/sec, decode speed of 1200 toks/sec.
Architecture: Simple Attention Network
Needle's architecture, called Simple Attention Network, follows a "minimalist" design:
- Encoder: 12 layers, but each layer has no FFN—only self-attention + RoPE + gated residual.
- Decoder: 8 layers doing cross-attention on encoder output.
- Dimension 512, 8 attention heads, 4 KV heads, vocab size 8192.
Removing the FFN is a bold choice. Transformer FFNs typically account for 2/3+ of model parameters. Cutting them drastically reduces model size, at the cost of expressive power. But Needle's goal isn't "do everything"—it does one task: tool calling.
Training
Pre-training used 16 TPU v6e for 200B tokens (27 hours). Post-training on 2B tokens of single-shot function call dataset (45 minutes).
Weights and training data are open-sourced on Hugging Face under MIT license.
Benchmarks
Needle outperforms FunctionGemma-270m, Qwen-0.6B, Granite-350m, and LFM2.5-350m on single-shot function calling. But Needle itself acknowledges: these competing models have broader capabilities and perform better in conversational settings.
Why This Matters
Tool calling is a core Agent capability. Currently, tool calling relies on frontier models (GPT-5.5, Claude Opus 4.7) running in the cloud, with non-trivial latency and cost.
Compressing tool calling to a 26M parameter model that runs locally could change the entire Agent architecture:
- Privacy: Tool calling data stays on-device.
- Latency: Local inference, no network round-trips.
- Cost: Free inference on edge devices.
26M parameters occupies roughly 50-100MB on a phone (depending on quantization). It can be bundled with the main app.
Primary sources: