Mathematics research might be the last fortress AI hasn't breached at scale.
Programming has SWE-bench 82, writing has various benchmarks, but math is different — frontier problems aren't "right or wrong" multiple choice. They require proofs, insight, and the kind of intuition that human mathematicians spend months developing.
On May 8, Google DeepMind released the AI co-mathematician tech report. Not a "model that solves problems," but a collaborative workbench designed for mathematicians.
It's not an answering machine
The system's positioning is clear: not replacing mathematicians, but working alongside them.
It's composed of multiple agents with distinct roles — one generates proof approaches, another verifies derivation steps, another searches relevant literature. Mathematicians can intervene, guide, and correct at any stage.
Sounds like standard Agent framework stuff? The difference is the difficulty level of the tasks.
FrontierMath Tier 4: 48%
FrontierMath is a research-level mathematics benchmark. Tier 4 is its highest difficulty, containing 50 problems that even university professors thought "AI wouldn't touch for decades."
AI co-mathematician scored 48% on these 50 problems.
What does that number mean? Close to half of top-tier research problems, this system produced partial or complete solution approaches. Not multiple choice — open-ended problems requiring constructive proofs.
The more interesting part is how it works: the system generates a proof, then its own reviewer agent checks the proof, flags errors, and the generator corrects itself. This "self-correction" loop is much more reliable than simple "generate once."
Real feedback from mathematicians
DeepMind had mathematicians actually test the system. One tester said: "It won't help you find the key insight — but once you have the insight, it helps you write out the full proof and fill in the details."
That's arguably the most realistic positioning for AI as a research tool right now: not replacing your inspiration, but amplifying your execution.
You do the "thinking," it does the "writing."
The gap with Claude and GPT
Current general models (including Claude 4, GPT-5.5) still hit a clear ceiling on pure mathematical reasoning. They handle medium-difficulty proofs, but when facing research-level problems requiring multi-step construction and cross-domain knowledge integration, they tend to get stuck on one detail and the whole proof collapses.
AI co-mathematician's design approach: split "one model does everything" into "multiple agents each do their job + human experts intervene at critical moments." It's not a smarter model — it's a smarter process.
Is it open source
The tech report is public, but the system itself remains a research prototype — no open source release, no API.
For Agent framework builders, this case is worth dissecting: when does multi-agent collaboration outperform single agents? What are the design patterns for self-correction loops? How do you determine when human intervention is needed?
Co-mathematician gives a concrete reference answer to these questions.
Primary sources: