Needle: Distilling Gemini 3.1 into a 26M Parameter Tool Calling Model

26M parameters. That's roughly one sixty-thousandth of GPT-3.

Cactus Compute has open-sourced Needle, which they describe as distilling Gemini 3.1 into a function calling model that runs on phones, watches, and even smart glasses.

This isn't a "theoretically runnable" small model. Their production benchmarks (on their Cactus inference framework) show: prefill speed of 6000 toks/sec, decode speed of 1200 toks/sec.

Architecture: Simple Attention Network

Needle's architecture, called Simple Attention Network, follows a "minimalist" design:

Encoder: 12 layers, but each layer has no FFN—only self-attention + RoPE + gated residual.
Decoder: 8 layers doing cross-attention on encoder output.
Dimension 512, 8 attention heads, 4 KV heads, vocab size 8192.

Removing the FFN is a bold choice. Transformer FFNs typically account for 2/3+ of model parameters. Cutting them drastically reduces model size, at the cost of expressive power. But Needle's goal isn't "do everything"—it does one task: tool calling.

Training

Pre-training used 16 TPU v6e for 200B tokens (27 hours). Post-training on 2B tokens of single-shot function call dataset (45 minutes).

Weights and training data are open-sourced on Hugging Face under MIT license.

Benchmarks

Needle outperforms FunctionGemma-270m, Qwen-0.6B, Granite-350m, and LFM2.5-350m on single-shot function calling. But Needle itself acknowledges: these competing models have broader capabilities and perform better in conversational settings.

Why This Matters

Tool calling is a core Agent capability. Currently, tool calling relies on frontier models (GPT-5.5, Claude Opus 4.7) running in the cloud, with non-trivial latency and cost.

Compressing tool calling to a 26M parameter model that runs locally could change the entire Agent architecture:

Privacy: Tool calling data stays on-device.
Latency: Local inference, no network round-trips.
Cost: Free inference on edge devices.

26M parameters occupies roughly 50-100MB on a phone (depending on quantization). It can be bundled with the main app.

Primary sources:

Architecture: Simple Attention Network

Training

Benchmarks

Why This Matters

Related

Chrome DevTools Officially Releases MCP Server: AI Coding Agents Can Finally "See" the Browser

Google I/O 2026: The "Agentification" of Search Isn't an Upgrade, It's a Rewrite

Google's SynthID Watermarking Technology Adopted by Giants Like OpenAI and Nvidia: AI Content Provenance Enters the Standardization Era