Claude Mythos METR Evaluation: Autonomous Task Time Doubles Past 16 Hours, The Watershed Moment from Assistant to Independent Worker

METR's time horizon benchmark has hit the wall.

Not "close to the limit" — directly crashed through it. Claude Mythos Preview, at the 50% success rate threshold, can independently complete tasks that would take a skilled human over 16 hours — and 16 hours happens to be the design ceiling of the current benchmark.

In other words, it might be able to go even longer, the ruler is just not long enough anymore.

The Numbers: From 30 Seconds to 16 Hours

METR's core metric is straightforward: how long a task can an AI system complete independently at a 50% success rate, measured by how long it would take a skilled human to do it.

This curve has been rising almost exponentially over the years:

2022: GPT-3.5's number was 30 seconds
2024: Claude 3.5 Sonnet reached about 1 hour
Late 2025: Claude Opus 4.6 approached 7-8 hours
Now: Claude Mythos Preview exceeds 16 hours, the ceiling of the benchmark test

18 months, from 1 hour to over 16 hours. More than tripled in time horizon.

What 16 Hours Means

16 hours of human work time is roughly a medium-complexity software engineering sprint: building a complete feature module, including requirements analysis, coding, testing, and deployment. Or writing a detailed business plan with market research, financial projections, and competitive analysis.

If AI can do this to this degree without human intervention — note, "independently completed," not a back-and-forth interactive session — then it is no longer an assistant. It is a colleague who does not need a lunch break.

Of course, a 50% success rate means it screws up half the time. But that number itself is also rapidly approaching the practicality threshold.

Anthropic Co-founder's Prediction

Against this backdrop, recent statements from Anthropic co-founder Dario Amodei are quite interesting. He does not believe AGI will happen in 2026, but predicts that within a year or two, a proof of concept might emerge on non-frontier models: a model that end-to-end trains its own successor.

"AI building AI" — this is not science fiction, it is what Amodei thinks might happen in 2027-2028.

The METR data gives this prediction a quantitative anchor. If autonomous task time continues to double at the current rate, 16 hours becomes 32 hours, then 64 hours... at some point, AI could indeed complete ultra-long-chain tasks like "training the next generation of models" without human intervention.

But Do Not Take It Too Seriously

A few caveats:

METR's benchmark has limitations. It measures task time span, not task quality. 16 hours of code output might be of dubious quality, 16 hours of research might be full of holes. Long time does not equal well done.

50% success rate is not good enough for engineering. If your CI/CD pipeline has a 50% success rate, nobody will use it. For autonomous tasks to go from "occasionally usable" to "reliable tool," the success rate needs to reach at least 90%.

Anthropic's own Mythos is still in Preview. The official version has not been released yet, and all data comes from early preview builds. The正式版 capability might be stronger, or it might be weakened by safety alignment.

Main sources:

The Numbers: From 30 Seconds to 16 Hours

What 16 Hours Means

Anthropic Co-founder's Prediction

But Do Not Take It Too Seriously

Related

ACC: Compiling Agent Trajectories into Long-Context QA for Direct Reasoning

RLVR Credit Assignment, Revisited: DelTA Takes a Discriminator View on Token-Level Rewards

Do MLLMs Really Read People? MM-OCEAN Finds 51% of "Correct Ratings" Are Guessing