Verdict
A million-token context window doesn’t mean “usable at one million tokens.” GPT-5.5 is currently the most reliable model for long-context retrieval (MRCR @ 1M at 74%), DeepSeek V4 and Gemini 2.5 Pro are in the middle (~50-60%), and Claude Opus 4.7 is weak at large windows (32.2%).
If your core need is having the model understand an entire document or large codebase, GPT-5.5 is the most reliable choice today. If you just need a “rough sense of what’s in the document,” any frontier model suffices.
Test Dimensions
Retrieval Accuracy
MRCR (Multi-Reference Context Retrieval) measures a model’s ability to locate key information in ultra-long context. At 1 million tokens:
| Model | MRCR @ 1M | Notes |
|---|---|---|
| GPT-5.5 | 74% | Best needle-in-haystack performance |
| Gemini 2.5 Pro | ~60% | Solid but misses some details |
| DeepSeek V4 | ~50% | Usable but complex queries lose information |
| Claude Opus 4.7 | 32.2% | Significant attention dispersion at large windows |
GPT-5.5’s 74% means out of 10 key data points in a million tokens, it finds 7-8 accurately.
Context Decay
All models show “decay effects” — information at the beginning and end of context is more easily retained, middle sections are often missed:
- GPT-5.5: Flattest decay curve, middle-section retention significantly better
- Gemini 2.5 Pro: Strong at start/end, moderate middle decay
- DeepSeek V4: Uniform decay, accuracy drops linearly with context length
- Claude Opus 4.7: Long context doesn’t appear to be a training priority; decay is sharp
Real-World Scenarios
Technical document QA: Give a model a 500-page API doc, ask about a specific endpoint’s parameters. GPT-5.5 is most accurate; Claude requires splitting the document into segments.
Codebase review: Feed an entire code repository, ask for potential bugs. GPT-5.5 and Gemini 2.5 Pro provide useful feedback; Claude and DeepSeek V4 often miss cross-file dependency issues.
Long document summarization: All models handle this well — it’s the easiest long-context task since it doesn’t require precise information location.
New Developments
AMD recently published the HyLo (Hybrid Long-context) architecture paper, proving long-context capability can be added post-pretraining at lower cost with minimal short-context quality loss. HyLo extends usable context to 2 million tokens. If adopted by mainstream models, this could reshuffle the long-context competitive landscape.
Recommendations
Precise information location in long docs: GPT-5.5. 74% MRCR is currently unmatched.
General understanding only (summarization, sentiment, topics): Any frontier model — pick the cheapest.
Local long-context solutions: Watch for AMD HyLo-based open-source models expected in coming months.
RAG vs Long Context: For “searching specific info across many documents,” traditional RAG remains more reliable than pure long context — segmented retrieval avoids attention decay. Long context is better for “understanding a single large document.”