Soohak: 43 mathematicians hand-crafted math problems for a real test of LLM research-level math

AIME, MATH, GSM8K... LLM math benchmarks are crowded. But one persistent criticism: many of these problems the model has already seen during pretraining.

Soohak wants a different approach.

arXiv:2605.09063 releases the Soohak benchmark, hand-crafted by 43 mathematicians (yes, 43), covering undergraduate to graduate-level mathematics. This isn't a question bank — it's a question creation effort.

Why hand-craft problems

Existing math benchmarks have several issues:

Data leakage. AIME and AMC problems are everywhere online. Models may have already "memorized" them during pretraining. Testing memory, not reasoning.

Low difficulty ceiling. GSM8K is elementary school level. MATH is high school competition level. Neither is sufficient for evaluating whether LLMs can do real math research.

Narrow coverage. Most benchmarks focus on algebra and combinatorics, with insufficient coverage of number theory, analysis, topology, etc.

Soohak's solution is brute force: have mathematicians create new problems. These problems aren't on the web, the model hasn't seen them — testing pure reasoning ability.

The weight of this benchmark

Participating institutions include EleutherAI, CMU (Sean Welleck, Graham Neubig), KAIST, and others with deep expertise in AI evaluation and math reasoning.

Sean Welleck has done several important LLM math capability papers. Graham Neubig's LLM research group at CMU is top-tier. This lineup shows Soohak is not a small project.

A real concern

Hand-crafted benchmarks have a natural limitation: high update cost. Each batch of problems requires significant mathematician time. If the community uses it well, will the problems leak into training data?

The paper notes the benchmark is under review, not yet publicly released. This may be an anti-leak measure.

My take

Soohak's direction is right. LLM math evaluation has reached the point where "existing benchmarks are getting maxed out." We need new, cleaner evaluation methods to distinguish "memorization ability" from "reasoning ability."

But I reserve one judgment: if Soohak's problems are eventually published, will model companies use them for post-training reinforcement learning? If so, Soohak's "cleanliness" can only last one evaluation round.

Primary sources:

arXiv:2605.09063 - Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Submitted by amphora on HuggingFace Daily Papers 2026-05-12

Why hand-craft problems

The weight of this benchmark

A real concern

My take

Related

ACC: Compiling Agent Trajectories into Long-Context QA for Direct Reasoning

RLVR Credit Assignment, Revisited: DelTA Takes a Discriminator View on Token-Level Rewards

Do MLLMs Really Read People? MM-OCEAN Finds 51% of "Correct Ratings" Are Guessing