When selecting AI models, many teams compare API per-million-token prices. But Stanford CRFM’s latest research reveals a serious flaw: models with cheap listed prices can actually cost dozens of times more to run.
Stanford’s 28x Reversal
The research team found:
- Gemini 3 Flash listed price: 1.7x cheaper than Claude Haiku 4.5
- Gemini 3 Flash actual cost (same task): 28x more expensive than Claude Haiku 4.5
Two core reasons:
- Token efficiency differences: Some models require more rounds and longer outputs for complex questions
- Task completion rate: If a model can’t answer correctly in one try, retry costs accumulate rapidly
The team estimates about 20% of model cost rankings reverse across different benchmarks.
Artificial Analysis Index Data
Latest cost data from April 25:
| Model | Total Evaluation Cost |
|---|---|
| Claude Opus 4.7 | $4,811 |
| Sonnet 4.6 | $3,959 |
| GPT-5.5 (xhigh) | $3,357 |
| GPT-5.5 (high) | $2,159 |
| GPT-5.5 (medium) | $1,199 |
| DeepSeek V4 Pro | $1,071 |