C
ChaoBro

Qwen3.6-27B Aims for Perfect Score on AIME25: New Watershed for Open-Source Math Reasoning

Qwen3.6-27B Aims for Perfect Score on AIME25: New Watershed for Open-Source Math Reasoning

What Happened

Community evaluator @nanowell published a striking set of data on X:

Qwen3.6-27B achieved 100% accuracy on the AIME25 math competition benchmark.

AIME (American Invitational Mathematics Examination) is a US mathematics invitational competition. AIME25 is an AI math reasoning benchmark based on this exam, with problems far beyond standard high school math, involving combinatorics, number theory, geometry, and other advanced reasoning skills.

The evaluator also noted:

“Qwen3.6 27B is one of the few open models that can reach 100% accuracy on AIME25. The model seems to have been particularly fine-tuned for this type of tasks. It’s much better than Qwen3.5 on average.”

Data Comparison: Qwen3.6 vs Qwen3.5

DimensionQwen3.5 SeriesQwen3.6-27BChange
AIME25~72%100%+28pp
Model size32B-72B multi-tier27BSmaller but stronger
Math reasoningGeneral fine-tuningTargeted reinforcementSpecialized tuning
Open source availabilityPartial weightsFull weights openMore open

Key Signals

  1. 27B scale achieves perfect score: This means medium-scale open models can match or even surpass closed-source models with hundreds of billions of parameters in specific domains.
  2. Targeted fine-tuning is highly effective: Alibaba clearly added a specialized math reasoning enhancement stage in Qwen3.6’s training pipeline.
  3. Average performance also surpasses predecessor: Not just math — Qwen3.6 shows clear improvement over Qwen3.5 across overall benchmarks.

Technical Path Speculation

Qwen3.6-27B’s breakthrough in math reasoning likely comes from several technical directions:

1. GRPO Reinforcement Learning Tuning

Alibaba previously published research on Qwen’s GRPO (Group Relative Policy Optimization) direction. GRPO is a reinforcement learning algorithm specifically designed for reasoning tasks, better suited for multi-step reasoning scenarios like math than traditional RLHF.

2. Think Token Optimization

The Qwen team has done extensive work on optimizing think tokens. By finely controlling the ratio of “thinking” to “output” during reasoning, the model can maintain answer quality while reducing reasoning latency.

3. Synthetic Data Distillation

Using larger-scale models (like Qwen3.6-Max) to generate high-quality math reasoning chains, then distilling them into the 27B model. This “teacher-student” distillation strategy is particularly effective for math reasoning tasks.

Open Source Ecosystem Impact

Qwen3.6-27B’s AIME25 perfect score carries significance beyond a benchmark number:

For Developers

  • Local deployment feasibility: 27B models can run on consumer-grade GPUs (like RTX 4090 24GB), meaning enterprises can obtain top-tier math reasoning capability locally.
  • Cost-effectiveness: Compared to calling closed-source APIs, running a 27B model locally is cheaper for large-scale inference scenarios.

For the Industry

  • Gap between open and closed source narrowing: In math reasoning, a domain traditionally led by closed-source models, open models have caught up or even surpassed.
  • Specialization trend: Future competition isn’t just about “all-around” models, but “domain-specialized” models.

For the Chinese Model Ecosystem

Qwen3.6’s continuous iteration solidifies Alibaba’s position in the first tier of Chinese large models. Combined with Qwen3.6-Max Preview’s performance on SWE-bench, Alibaba is building a comprehensive open-source model matrix from code to math.

Landscape Assessment

Qwen3.6-27B’s AIME25 perfect score releases three clear signals:

  1. Model size is no longer the determining factor for performance — 27B can beat larger models; the key is training strategy.
  2. Math reasoning is becoming the new touchstone for model capability — after code capability, math reasoning becomes the new standard for distinguishing model tiers.
  3. Open source models’ “targeted reinforcement” route is working — rather than pursuing all-around competence, achieving excellence in key domains is the winning strategy.

Action Recommendations

  1. Math-intensive applications should prioritize testing Qwen3.6-27B: In education, research, financial modeling, etc., this model offers excellent cost-effectiveness.
  2. Watch for other Qwen3.6 series size variants: If 27B already achieves a perfect score, the larger 35B and smaller 4B/7B versions deserve continued attention.
  3. Deploy with local inference frameworks: Combined with LM Studio, Ollama, and other local inference tools, you can get top-tier math reasoning capability at zero cost.
  4. Compare with Kimi K2.6 and DeepSeek V4: As domestic open-source models, the math reasoning capability comparison among these three will provide direct reference for model selection.

A new watershed for open-source math reasoning has arrived. Qwen3.6-27B proves: medium scale + precise tuning = top-tier performance.