Oxford/LLNL Chain-of-Thought Benchmark: GPT 95.7% Single, Collapses to 9.83% Chained

Oxford/LLNL Chain-of-Thought Benchmark: GPT 95.7% Single, Collapses to 9.83% Chained

Conclusion

Oxford University and Lawrence Livermore National Laboratory (LLNL) jointly published a benchmark study on long-horizon chain-of-thought reasoning. Using GPT 5.2 as the test subject, the study found that the model achieves 95.7% accuracy on individual problems, but when those same problems are chained into multi-step tasks, accuracy collapses to 9.83%.

This result reveals a core bottleneck of current AI models: strong individual capabilities, but system-level failure due to error accumulation in multi-step chains. The research team notes this is not a problem that can be fixed through simple optimization.

Test Dimensions

Benchmark Design

The research team selected a set of problems that GPT 5.2 can solve independently with 95.7% accuracy. They then organized these problems into a chain requiring sequential completion — where each step’s output serves as the next step’s input.

Result: when these high-accuracy individual tasks were chained together, overall accuracy dropped to 9.83%. Near-perfect capability almost completely fails in multi-step scenarios.

Error Cascading Effect

The accuracy collapse from 95.7% to 9.83% stems from cascading error amplification:

  • Even a 4.3% error rate in the first step contaminates the input for all subsequent steps
  • As the chain grows, compound error rates increase exponentially
  • The model cannot “self-check” and “self-correct” at intermediate steps

Why It “Cannot Be Fixed”

The research team identifies three core reasons:

  1. Self-attention limitations: Transformer architecture dilutes early-step information through attention weights in later steps when processing long chains
  2. Lack of intermediate verification: The model does not actively verify output correctness after each step, instead passing results directly to the next step
  3. Distribution shift: Even with low per-step error rates, the input distribution rapidly diverges from training data distribution after multi-step chaining

Implications for Practical Applications

ScenarioRisk LevelExplanation
Single Q&A/analysisLowIndividual task accuracy remains very high
Multi-step workflowsHighLonger chains = higher overall failure rate
Autonomous agentsVery highAgents are essentially long reasoning chains and need additional error recovery mechanisms
Scientific discovery pipelinesHighMulti-stage research processes require human intervention at key nodes

Selection Guidance

  • Single-task scenarios: Current models are sufficient — 95.7% accuracy is acceptable in most contexts
  • Multi-step workflows: Add human review or cross-validation at critical nodes; do not fully rely on automatic chaining
  • Agent development: Must include error detection and fallback mechanisms; do not assume chains will execute smoothly end-to-end
  • Research/engineering decisions: Understand the “chain collapse” property of models and set checkpoints in critical processes

Primary Sources