Running SimpleQA to 95% Locally: local-deep-research Lets Qwen3.6-27B Beat Cloud on a 3090

When OpenAI released SimpleQA, the goal was to test whether models "can answer simple factual questions." The results? GPT-4o scored 61%, o1 around 70%+. Most models performed mediocrely.

Then an open-source project called local-deep-research said: I'm using Qwen3.6-27B, one RTX 3090, and hitting ~95%.

This isn't a breakthrough in model capability — it's systems engineering.

How It Works

The core idea isn't training a smarter model, but making a decent model extremely reliable through toolchains and search strategies.

Specifically:

10+ search engine integration — arXiv, PubMed, general search, private documents, extremely broad coverage
Multi-round search & verification — not searching once and answering, but repeatedly searching and cross-validating
Local & fully encrypted — data stays on your machine, a hard requirement for privacy-sensitive scenarios
Supports all major LLM backends — llama.cpp, Ollama, Google API, various cloud models

7,572 stars, 2,046 gained this week. 6,448 commits — this project is incredibly active.

Understanding the 95% Number

SimpleQA measures factual QA accuracy. local-deep-research hits 95% not because the model itself is that smart (Qwen3.6-27B's raw SimpleQA score is much lower), but because:

Search engines provide external knowledge
Multi-round search strategies cover information blind spots
Cross-validation reduces hallucination

In other words, it turned "model knowledge储备" into "model retrieval and verification capability." This is actually closer to how humans do research — you don't answer everything from memory; you look things up, compare sources, and draw conclusions.

But There Are Limits

Slow. Multi-round search + reasoning means one query can take tens of seconds or even minutes. Not all scenarios can accept this latency.

Token consumption isn't low. Even though the model runs locally and doesn't cost API money, inference speed for a 27B model on a 3090 is limited. Batch queries will queue up.

95% is under specific conditions. README says "~~95% on SimpleQA (e.g. Qwen3.6-27B on a 3090)" — that "~~" matters. Different question types and search engine configurations produce different results.

My Take

local-deep-research represents an important trend: local small model + search augmentation > cloud large model, at least in the factual QA track.

This doesn't mean GPT-4o is done. It means for scenarios needing accurate factual answers, a locally deployed medium model with good search strategy can outperform solutions that rely solely on the model's internal knowledge.

Main sources:

How It Works

Understanding the 95% Number

But There Are Limits

My Take

Related

Chrome DevTools Officially Releases MCP Server: AI Coding Agents Can Finally "See" the Browser

Google I/O 2026: The "Agentification" of Search Isn't an Upgrade, It's a Rewrite

Google's SynthID Watermarking Technology Adopted by Giants Like OpenAI and Nvidia: AI Content Provenance Enters the Standardization Era