Long Context Showdown: Whose Million-Token Window Actually Works

Verdict

A million-token context window doesn’t mean “usable at one million tokens.” GPT-5.5 is currently the most reliable model for long-context retrieval (MRCR @ 1M at 74%), DeepSeek V4 and Gemini 2.5 Pro are in the middle (~50-60%), and Claude Opus 4.7 is weak at large windows (32.2%).

If your core need is having the model understand an entire document or large codebase, GPT-5.5 is the most reliable choice today. If you just need a “rough sense of what’s in the document,” any frontier model suffices.

Test Dimensions

Retrieval Accuracy

MRCR (Multi-Reference Context Retrieval) measures a model’s ability to locate key information in ultra-long context. At 1 million tokens:

Model	MRCR @ 1M	Notes
GPT-5.5	74%	Best needle-in-haystack performance
Gemini 2.5 Pro	~60%	Solid but misses some details
DeepSeek V4	~50%	Usable but complex queries lose information
Claude Opus 4.7	32.2%	Significant attention dispersion at large windows

GPT-5.5’s 74% means out of 10 key data points in a million tokens, it finds 7-8 accurately.

Context Decay

All models show “decay effects” — information at the beginning and end of context is more easily retained, middle sections are often missed:

GPT-5.5: Flattest decay curve, middle-section retention significantly better
Gemini 2.5 Pro: Strong at start/end, moderate middle decay
DeepSeek V4: Uniform decay, accuracy drops linearly with context length
Claude Opus 4.7: Long context doesn’t appear to be a training priority; decay is sharp

Real-World Scenarios

Technical document QA: Give a model a 500-page API doc, ask about a specific endpoint’s parameters. GPT-5.5 is most accurate; Claude requires splitting the document into segments.

Codebase review: Feed an entire code repository, ask for potential bugs. GPT-5.5 and Gemini 2.5 Pro provide useful feedback; Claude and DeepSeek V4 often miss cross-file dependency issues.

Long document summarization: All models handle this well — it’s the easiest long-context task since it doesn’t require precise information location.

New Developments

AMD recently published the HyLo (Hybrid Long-context) architecture paper, proving long-context capability can be added post-pretraining at lower cost with minimal short-context quality loss. HyLo extends usable context to 2 million tokens. If adopted by mainstream models, this could reshuffle the long-context competitive landscape.

Recommendations

Precise information location in long docs: GPT-5.5. 74% MRCR is currently unmatched.

General understanding only (summarization, sentiment, topics): Any frontier model — pick the cheapest.

Local long-context solutions: Watch for AMD HyLo-based open-source models expected in coming months.

RAG vs Long Context: For “searching specific info across many documents,” traditional RAG remains more reliable than pure long context — segmented retrieval avoids attention decay. Long context is better for “understanding a single large document.”

Verdict

Test Dimensions

Retrieval Accuracy

Context Decay

Real-World Scenarios

New Developments

Recommendations

Primary Sources

Related

Kimi K2.6 Tops Design Arena: Moonshot AI Surpasses All US Models in 3D Design

Qwen 3.6 Max BS Benchmark Review: Anti-Hallucination Capability Surpasses All OpenAI Models

Oxford/LLNL Chain-of-Thought Benchmark: GPT 95.7% Single, Collapses to 9.83% Chained