GPT-5.5, Claude Opus 4.7, Gemini 3.1 Within 3 Points: Has the Frontier Model Intelligence Ceiling Arrived?

Take a look at Artificial Analysis's latest model intelligence index and you'll notice something we haven't seen before:

GPT-5.5 (xhigh) scores 60, Claude Opus 4.7 (max) scores 57, Gemini 3.1 Pro Preview also 57. Kimi K2.6 and MiMo-V2.5-Pro each at 54.

The top three are separated by just 3 points. If you factor in measurement error and benchmark variance, that gap is essentially a statistical dead heat.

This would have been unthinkable a year ago. Back then, the leap from GPT-4 to GPT-4.5, from Claude 3 to Claude 4, each was a double-digit score difference. Now? Everyone's packed into a very narrow band.

What does this mean?

The first layer is straightforward: the "absolute intelligence" growth of frontier models is decelerating. Not stopping, but decelerating. When all the top players have access to similar training data, similar compute scale, similar architectures (Transformer + MoE + RLHF/RLVR), marginal improvements naturally get smaller.

The second layer is more interesting: the logic for choosing models is undergoing a fundamental shift.

If intelligence is roughly the same, the deciding factors shift elsewhere:

Speed: Mercury 2 runs at 905 tokens/s, while frontier reasoning models might only manage 20-30 tokens/s. For most everyday tasks, the speed difference has far more impact than that 3-point intelligence gap.
Price: GPT-5.5 (xhigh) costs thousands of times more than Qwen3.5 0.8B. If an 8B model handles 90% of your task, why pay 50x more for the remaining 10%?
Context window: Llama 4 Scout has a 10 million token context window, while most frontier reasoning models are still in the hundreds of thousands to low millions range. For long document processing, this difference is qualitative.
Tool calling and Agent capabilities: These aren't in the "intelligence index," but their impact on actual workflows might be even greater.

I'm not saying frontier models don't matter. When your task is "solve a math problem nobody has solved before" or "analyze a 500-page legal document to find hidden clauses," those extra 3 points might be the difference between can and can't. But for the vast majority of application scenarios — coding, writing, data analysis, customer service — the difference between a 54-point model and a 60-point model, users probably won't even feel it.

Model companies need the "we're the strongest" narrative to sustain valuations and pricing. But users' actual needs don't require that narrative. Users need "good enough and cheap."

This also explains why the Qwen3.5 series dominates the speed and price charts while its intelligence index is only in the low 30s — for a large number of tasks, low-30s is enough, but the speed is 905 tokens/s and the price is $0.02/M tokens. That cost-performance combination is far more attractive than "60 points but 30x slower and 1000x more expensive."

Over the next 6-12 months, I expect to see more "scenario-specialized" models emerge: not chasing general intelligence scores, but being the best in their specific lane. Code models, legal models, medical models, multilingual models — each taking first place in its own赛道, rather than trying to squeeze out 2 more points on a general leaderboard.

This isn't a decline in model capability; it's market maturation. When technical differences narrow, competition naturally shifts to engineering efficiency, cost control, and scenario fit.

Sources:

Artificial Analysis: Model Comparison
Model pricing and speed data from official vendor price pages

Related

ACC: Compiling Agent Trajectories into Long-Context QA for Direct Reasoning

RLVR Credit Assignment, Revisited: DelTA Takes a Discriminator View on Token-Level Rewards

Do MLLMs Really Read People? MM-OCEAN Finds 51% of "Correct Ratings" Are Guessing