It's complicated. Opus 4.5 is actually not that good at the 80% threshold but is above others at 50% threshold of completion. I read there's a single task around 16h that the model completed, and the broad CI comes from that.
METR currently simply runs out of tasks at 10-20h, and as a result you have a small N and lots of uncertainty there. (They fit a logistic to the discrete 0/1 results to get the thresholds you see in the graph.) They need new tasks, then we'll know better.
The text continues "with current AI tools" which is not clearly defined to me (does it mean current Gen + scaffold? Anything which is llm reasoning model? Anything built with a large llm inside? ). In any case, the title is misleading for not containing the end of the sentence. Please can we fix the title?
I think there are two separate things. Slowness of progress in research is good bc it signals high value/difficulty. This I wholeheartedly agree. The other is, the slowness of solving a given problem is good, which is less clear.
I think indubitably intelligence should be linked to speed. If you can since everything faster I think smarter is a correct label. What I also think is true is that slowness can be a virtue in solving problems for a person and as a strategy. But this is usually because fast strategies rely on priors/assumptions and ideas which generalize poorly; and often more general and asymptotically faster algorithms are slower when tested on a limited set or on a difficulty level which is too low
I think part of the message is that speed isn't a free lunch. If an intelligence can solve "legible" problems quickly, it's symptomatic of a specific adaption for identifying short paths.
So when you factor speed into tests, you're systematically filtering for intelligences that are biased to avoid novelty. Then if someone is slow to solve the same problems, it's actually a signal that they have the opposite bias, to consider more paths.
IMO the thing being measured by intelligence tests is something closer to "power" or "competitive advantage".
I haven’t looked into the source study, so who knows if it’s good, but I recall this article about smart people taking longer to provide answers to hard problems because they take more into consideration, but are much more likely to be correct.
I turned on API billing on API Studio in the hope of getting the best possible service. As long as you are not using the Gemini thinking and research APIs for long-running computations, the APIs are very inexpensive to use.
I think it's probably actually better at math. Though still not enough to be useful in my research in a substantial way. Though I suspect this will change suddenly at some point as the models move past a certain threshold (also it is heavily limited by the fact that the models are very bad at not giving wrong proofs/counterexamples) so that even if the models are giving useful rates of successes, the labor to sort through a bunch of trash makes it hard to justify.
OK but if the verification loop really makes the agents MUCH more useful, then this usefulness difference can be used as a training signal to improve the agents themselves. So this means the current capabilities levels are certainly not going to remain for very long (which is also what I expect but I would like to point out it's also supported by this)
reply