METR's time horizon benchmark has hit the wall.
Not "close to the limit" — directly crashed through it. Claude Mythos Preview, at the 50% success rate threshold, can independently complete tasks that would take a skilled human over 16 hours — and 16 hours happens to be the design ceiling of the current benchmark.
In other words, it might be able to go even longer, the ruler is just not long enough anymore.
The Numbers: From 30 Seconds to 16 Hours
METR's core metric is straightforward: how long a task can an AI system complete independently at a 50% success rate, measured by how long it would take a skilled human to do it.
This curve has been rising almost exponentially over the years:
- 2022: GPT-3.5's number was 30 seconds
- 2024: Claude 3.5 Sonnet reached about 1 hour
- Late 2025: Claude Opus 4.6 approached 7-8 hours
- Now: Claude Mythos Preview exceeds 16 hours, the ceiling of the benchmark test
18 months, from 1 hour to over 16 hours. More than tripled in time horizon.
What 16 Hours Means
16 hours of human work time is roughly a medium-complexity software engineering sprint: building a complete feature module, including requirements analysis, coding, testing, and deployment. Or writing a detailed business plan with market research, financial projections, and competitive analysis.
If AI can do this to this degree without human intervention — note, "independently completed," not a back-and-forth interactive session — then it is no longer an assistant. It is a colleague who does not need a lunch break.
Of course, a 50% success rate means it screws up half the time. But that number itself is also rapidly approaching the practicality threshold.
Anthropic Co-founder's Prediction
Against this backdrop, recent statements from Anthropic co-founder Dario Amodei are quite interesting. He does not believe AGI will happen in 2026, but predicts that within a year or two, a proof of concept might emerge on non-frontier models: a model that end-to-end trains its own successor.
"AI building AI" — this is not science fiction, it is what Amodei thinks might happen in 2027-2028.
The METR data gives this prediction a quantitative anchor. If autonomous task time continues to double at the current rate, 16 hours becomes 32 hours, then 64 hours... at some point, AI could indeed complete ultra-long-chain tasks like "training the next generation of models" without human intervention.
But Do Not Take It Too Seriously
A few caveats:
METR's benchmark has limitations. It measures task time span, not task quality. 16 hours of code output might be of dubious quality, 16 hours of research might be full of holes. Long time does not equal well done.
50% success rate is not good enough for engineering. If your CI/CD pipeline has a 50% success rate, nobody will use it. For autonomous tasks to go from "occasionally usable" to "reliable tool," the success rate needs to reach at least 90%.
Anthropic's own Mythos is still in Preview. The official version has not been released yet, and all data comes from early preview builds. The正式版 capability might be stronger, or it might be weakened by safety alignment.
Main sources: