Discussion about this post

User's avatar
Adam's avatar

> The METR graph cannot be saved

I think this is not a helpful framing at all! Instead the question should be: how should and shouldn't I update my views based on the graph. Clearly it contains a bunch of signal and so the correct amount to update is not zero, and also clearly you need to think about how that signal generalises to other claims you might want to think about.

I think arguments like these about how well it generalises are most useful if they write down concrete predictions about exactly to what extent it will and won't generalise. We're working against the clock of rapid AI progress here, we can't pre-empt everything on the first shot, and so it's probably most valuable to do a good first version and then work to resolve the most important and decision-relevant differences in people's predictions based on the results of the first version (and the state of play more generally).

On the release of this paper some commentators (e.g. Gary Marcus, IIRC) said they expected that the exponential time horizon trend wouldn't generalise to other kinds of tasks outside of software engineering. I think (at least on some operationalisations) that claim was probably wrong, based on METR's follow-up investigating that question https://metr.org/blog/2025-07-14-how-does-time-horizon-vary-across-domains

In this post, I'm glad to see you predict that the baseliners on their initial suite took longer than a different sample would if incentivised differently. It might be helpful to write up a few of these predictions, and take them to METR (or your favourite non-METR research group that wants to do a replication) and see if we can get evidence on those too.

Victualis's avatar

Your critique is reasonable. I agree with much of it. However, the interesting part of the graph really isn't the y-axis, which you point out has values pulled out of a sack marked "METR buddies doing random tasks, using weird metrics and incentives". The graph is getting attention because it shows a nice linear pattern on a graph with a linear horizontal axis and a logarithmic vertical axis, and maybe an even steeper more recent trend. This supports claims that LLM progress is continuing and that capabilities are advancing something like exponentially, and maybe that we recently see an even higher rate of improvement. The exact parameters of the curve are close to irrelevant if this really is exponential growth. Peter Thiel could hypothetically order 1000 Palantir coders to independently implement some difficult task related to their codebase, and use those timings as a more rigorous baseline, but the belief is that LLMs would show the same pattern on such a dataset.

I think there is a strong argument to be made that the fixed 50% threshold is a weakness. It's also a problem that the benchmark assumes that if a system has mastered at 50% some set of tasks, then it has also mastered all shorter duration tasks at at least 50% (which might not hold, as RL to prefer quick solutions to longer tasks might hurt correctness on shorter tasks). I can also accept that it's possible for the initial duration labels to be adversarially selected to facilitate apparent exponential progress (a nontrivial combinatorial problem but maybe worth investigating). I fully agree with you that trying to say anything about the absolute capability levels of LLMs based on the METR 50% graph is silly. But I'm not convinced your specific complaints undermine the main way that the graph has been interpreted as a claim about relative improvement over time.

14 more comments...

No posts

Ready for more?