Running at 95% SimpleQA Accuracy on a Single RTX 3090: local-deep-research Brings Academic Research Back to the Local Machine

The Bottom Line

SimpleQA is a factuality-focused question-answering benchmark introduced by OpenAI—designed specifically to evaluate whether a model knows the correct answer, not whether it can fabricate a plausible-sounding one. A 95% score means that, on consumer-grade hardware, a locally run model now approaches—or even surpasses—the factual accuracy of many cloud-based APIs.

What Makes This Project Stand Out

1. 95% SimpleQA Is Not Just Marketing

The project uses Qwen3.6-27B deployed on an RTX 3090 (24 GB VRAM). Achieving this level of performance with a 27B-parameter model demonstrates remarkable maturity in quantization and inference optimization. For context: OpenAI’s own GPT-4o scores just above 80% on SimpleQA (per publicly available data). While benchmark conditions may differ, a result at this scale is undeniably significant.

2. Integration with 10+ Search Engines and Data Sources

This goes far beyond wrapping Google Search. It natively integrates arXiv, PubMed, and private document repositories—enabling automated, unified retrieval from academic paper archives, biomedical databases, and your personal notes—eliminating the need for manual tool-switching during research.

3. Fully Local & Encrypted

In privacy-sensitive domains—including medical research, legal analysis, and enterprise knowledge management—keeping data strictly local is non-negotiable. From its core architecture onward, this project embraces a local-first philosophy: all inference and retrieval occurs entirely on-device.

Engineering Maturity

6,432 commits, 155 tags, and 439 branches—this is no weekend hackathon project. Its weekly gain of 2,449 stars reflects rapidly growing community momentum.

Recent traction stems from several key features:

Source-tagged citations with global counter (#4012): Critical for academic integrity and traceability
Pre-commit hooks validating settings key namespaces: Strong engineering discipline
Robust CI/CD permission management

But Don’t Get Too Excited—Yet

A few important caveats:

95% SimpleQA ≠ 95% General Research Capability

SimpleQA measures factual recall—not deep reasoning, literature synthesis, hypothesis generation, or cross-domain analytical integration. These higher-order research tasks fall outside its scope.

Real-World Experience Running a 27B Model on an RTX 3090

Running a 27B model on 24 GB VRAM necessitates aggressive quantization—most likely 4-bit. Actual inference latency and precision trade-offs require empirical validation. End-to-end pipeline latency—spanning document retrieval, reasoning, and response generation—may reach the minute range.

187 Open PRs

High community activity is encouraging—but 187 pending pull requests also suggest maintainers may be stretched thin. Before adopting, verify whether your required functionality is already available in the stable branch.

How to Choose Between Local and Cloud Solutions

Scenario	Recommended Approach
Privacy-sensitive use cases; strict data-locality requirements	local-deep-research
Need for lowest possible latency	Cloud APIs (e.g., Claude, GPT)
Large-scale batch research workflows	local-deep-research (no API costs)
Requirement for cutting-edge model capabilities	Cloud APIs (local models inevitably lag in release timing)

In One Sentence

This project proves that, for well-defined research tasks, medium-scale models deployed locally can now deliver factual accuracy rivaling that of large cloud-hosted models—not as a wholesale replacement for cloud APIs, but as a compelling, privacy-preserving, cost-effective alternative when API expenses and data sovereignty become hard constraints.

Primary Sources:

LearningCircuit/local-deep-research GitHub
SimpleQA benchmark: OpenAI’s factuality-focused question-answering evaluation suite

The Bottom Line

What Makes This Project Stand Out

Engineering Maturity

But Don’t Get Too Excited—Yet

How to Choose Between Local and Cloud Solutions

In One Sentence

Related

Presenton Is Not "Just Another AI PPT": It Turns Presentations into a Deployable Generation Workflow

The Real Appeal of Midscene: UI Automation Can Finally Ditch Fragile Selectors

A New Closed Loop for Frontend Debugging: Chrome DevTools MCP Reduces Guesswork for Coding Agents