Chatbot Arena April 2026: Anthropic Sweeps Top Four, Open-Source Gap Narrows

Chatbot Arena April 2026: Anthropic Sweeps Top Four, Open-Source Gap Narrows

As of late April 2026, the latest LMSYS Chatbot Arena rankings reveal a clear landscape: Anthropic leads both text and code tracks, but the open-source camp is accelerating its catch-up.

Text Top 10: Anthropic Takes Four Seats

The Arena text leaderboard top 10 (Elo scores, higher is better):

RankModelScoreLab
1claude-opus-4-7-thinking1503 ±8Anthropic
2claude-opus-4-6-thinking1501 ±5Anthropic
3claude-opus-4-61496 ±5Anthropic
4claude-opus-4-71493 ±7Anthropic
5gemini-3.1-pro-preview1493 ±5Google
6muse-spark1489 ±7Meta
7gpt-5.5-high1488 ±10OpenAI
8gemini-3-pro1486 ±4Google
9grok-4.20-beta11481 ±5xAI
10gpt-5.4-high1479 ±6OpenAI

Four key observations:

Anthropic’s thinking mode shows a clear advantage. claude-opus-4-7-thinking leads at 1503, 10 points above its non-thinking counterpart (1493). The gap widens in the code leaderboard — thinking mode reaches 1571, 6 points higher.

OpenAI GPT-5.5 underperforms expectations. gpt-5.5-high ranks seventh at 1488, behind all Claude variants and Gemini 3.1 Pro. The error margin of ±10 is the largest among the top 10, indicating the widest divergence in user evaluations.

Meta muse-spark enters the top 6 for the first time. At 1489, it surpasses GPT-5.5 and becomes the highest-ranking non-Anthropic/Google model. If confirmed as open-source, it would be the strongest open-source text model currently available.

Google’s twins are stable but lack breakthroughs. gemini-3.1-pro-preview (1493) and gemini-3-pro (1486) rank fifth and eighth, with a small gap suggesting limited user-perceived improvement from 3.0 to 3.1 Pro.

Code Leaderboard: Anthropic’s Dominance Is Stronger

The code Arena shows an even wider gap:

RankModelScore
1claude-opus-4-7-thinking1571
2claude-opus-4-71565
3claude-opus-4-6-thinking1551
4claude-opus-4-61548
5glm-5.11534
6kimi-k2.61529
7claude-sonnet-4-61525
8muse-spark1510
9gpt-5.5-high (codex-harness)1500
10claude-opus-4-5-thinking-32k1491

Anthropic’s advantage is even more pronounced in code — the top four are all Claude. GLM-5.1 and Kimi-K2.6, at 1534 and 1529 respectively, represent the best performance from Chinese models in the code Arena.

Notably, GPT-5.5 requires the Codex harness to reach 1500 in code, with the standalone version ranking even lower. This suggests that for pure code generation and editing, GPT-5.5 needs additional engineering integration to perform at its best.

Open-Source Progress

Combining Arena data with known open-source status:

  • muse-spark (Meta): If confirmed open-source, its 1489 text score and 1510 code score both exceed GPT-5.5.
  • Xiaomi MiMo-V2.5-Pro: Reached open-source model #1 in text and global sixth, with Agent index #1 among open-source models.
  • GLM-5.1 (Zhipu): Fifth in code Arena at 1534, the highest-ranking Chinese model in code.

The gap between open-source and closed-source #1 has narrowed from 50+ points a year ago to 15-20 points, meaning open-source models are approaching closed-source flagships in real-world usability.

Landscape Assessment

The current Arena reflects a tri-polar landscape: Anthropic leads in both text and code, Google maintains a stable second tier with Gemini, and OpenAI’s GPT-5.5 has not reproduced its past dominance in crowdsourced evaluation. In the open-source camp, Meta and Chinese models are closing the gap but remain some distance from fully surpassing closed-source flagships.

For readers: if you need a model stable in both conversation and code, Claude Opus 4.7 remains the top choice. For cost-effectiveness and controllability, Xiaomi MiMo-V2.5-Pro and GLM-5.1 are worth trying.


Main sources: