C
ChaoBro

Xiaomi MiMo-V2.5-Pro Tops GDPval-AA Benchmark, China Open-Source Model Landscape Reshaped

Xiaomi MiMo-V2.5-Pro Tops GDPval-AA Benchmark, China Open-Source Model Landscape Reshaped

Key Takeaway

The latest GDPval-AA benchmark results for real-world agentic workloads are out, and Xiaomi MiMo-V2.5-Pro takes first place with a score of 1578, ending DeepSeek’s streak in this evaluation. The gap among China’s top five open-source models has narrowed to within 94 points, shifting the competitive landscape from “one dominant player” to “many rising contenders.”

ModelGDPval-AA ScoreRankRelease Date
Xiaomi MiMo-V2.5-Pro157812026.05
DeepSeek V4 Pro155422026.04
GLM 5.1153532026.04
MiniMax M2.7151442026.04
Kimi K2.6148452026.04

What Happened

GDPval-AA is a benchmark focused on real-world agentic capabilities. Unlike traditional knowledge quizzes or multiple-choice tests, it evaluates a model’s planning, tool-calling, and multi-step reasoning abilities in practical tasks.

MiMo-V2.5-Pro’s rise to the top sends several key signals:

First, smartphone manufacturers are entering the foundation model battlefield. Xiaomi’s AI presence has historically been concentrated in end-user applications (phone AI assistants, IoT devices), with the MiMo series serving primarily as a supporting model for its own ecosystem. V2.5-Pro breaking into the top tier of open-source benchmarks signals that phone manufacturers are moving from the “AI application layer” into the “foundation model layer.”

Second, the five-way gap is only 94 points. The difference between the top score of 1578 and fifth place at 1484 is just 6%, meaning that on this evaluation dimension, China’s top open-source models have entered a “no absolute king” competitive phase. User choice is no longer determined by benchmark scores alone — API pricing, context window size, and inference speed all factor in.

Cross-Benchmark Comparison: Different Dimensions, Different Winners

GDPval-AA is just one piece of the evaluation puzzle. Across multiple independent benchmarks, the top five models each have their strengths:

ModelGDPval-AASWE-benchCodingChineseBest Use Case
MiMo-V2.5-Pro1578MediumAbove AverageAverageAgentic Workflows
DeepSeek V4 Pro1554HighHighHighAll-Around Balanced
GLM 5.11535HighHighHighTool Calling + Chinese
MiniMax M2.71514MediumMediumMediumMultimodal
Kimi K2.61484Very HighVery HighHighCode Generation

Kimi K2.6 ranks last on GDPval-AA but excels on SWE-bench (software engineering benchmark) — this demonstrates that different benchmarks reflect different capability dimensions, and model selection must be scenario-specific rather than score-driven.

Landscape Assessment

May 2026 is China’s “super release month” for open-source models. In addition to the five models above, MiniMax M3 is also on the way. This timing isn’t coincidental — every lab is racing to position its product before Google I/O (mid-May) and Anthropic’s developer conference (May 6).

For developers and enterprise users, this is both a “choice overload” period and the best window to evaluate:

  • If you need the strongest agentic workflow capability → MiMo-V2.5-Pro is the current pick
  • If you need balanced coding + Chinese + tool capabilities → DeepSeek V4 Pro or GLM 5.1
  • If you focus on software engineering → Kimi K2.6 remains strongest on SWE-bench
  • If you need multimodal capabilities → MiniMax M2.7 deserves testing

Action Items

  1. Don’t rely on a single benchmark: GDPval-AA focuses on agentic capability, SWE-bench on coding, LMArena on user feel. Reference the benchmark that matches your actual use case.
  2. Run your own benchmarks: Each model may have uncovered advantages in specific domains. A/B test with your own task set.
  3. Watch the API price war: As model capabilities converge, price becomes the main differentiator. DeepSeek has already initiated API price cuts — others are expected to follow.