Hacker Newsnew | past | comments | ask | show | jobs | submit | Davidzheng's commentslogin

Big error bars and METR people are saying the longer end of the benchmark are less accurate right now. I think they mean this is a lower bound!

It's complicated. Opus 4.5 is actually not that good at the 80% threshold but is above others at 50% threshold of completion. I read there's a single task around 16h that the model completed, and the broad CI comes from that.

METR currently simply runs out of tasks at 10-20h, and as a result you have a small N and lots of uncertainty there. (They fit a logistic to the discrete 0/1 results to get the thresholds you see in the graph.) They need new tasks, then we'll know better.


Thanks for this comment. I've been trying to find anything about the huge error bars. Do you have any sources you can share for further reading?

The text continues "with current AI tools" which is not clearly defined to me (does it mean current Gen + scaffold? Anything which is llm reasoning model? Anything built with a large llm inside? ). In any case, the title is misleading for not containing the end of the sentence. Please can we fix the title?

Also i think the main source of interest is because it is said by Terry, so that should be in the title too.

I think there are two separate things. Slowness of progress in research is good bc it signals high value/difficulty. This I wholeheartedly agree. The other is, the slowness of solving a given problem is good, which is less clear.

I think indubitably intelligence should be linked to speed. If you can since everything faster I think smarter is a correct label. What I also think is true is that slowness can be a virtue in solving problems for a person and as a strategy. But this is usually because fast strategies rely on priors/assumptions and ideas which generalize poorly; and often more general and asymptotically faster algorithms are slower when tested on a limited set or on a difficulty level which is too low


I think part of the message is that speed isn't a free lunch. If an intelligence can solve "legible" problems quickly, it's symptomatic of a specific adaption for identifying short paths.

So when you factor speed into tests, you're systematically filtering for intelligences that are biased to avoid novelty. Then if someone is slow to solve the same problems, it's actually a signal that they have the opposite bias, to consider more paths.

IMO the thing being measured by intelligence tests is something closer to "power" or "competitive advantage".


> Then if someone is slow to solve the same problems, it's actually a signal that they have the opposite bias, to consider more paths.

No this isn't true, most of the time they just don't consider any paths at all and are just dumb.

And the bias towards novelty doesn't make you slow, ADHD is biased towards novelty and people wouldn't call those slow.


What I meant is, assuming that they do find solutions. If they're not doing anything of course that's different.

In the article, "speed" is about reaching specific answers in a specific window of time, the bane of ADHD.


I haven’t looked into the source study, so who knows if it’s good, but I recall this article about smart people taking longer to provide answers to hard problems because they take more into consideration, but are much more likely to be correct.

https://bigthink.com/neuropsych/intelligent-people-slower-so...


on aistudio the free tier limits on all models are decent

I turned on API billing on API Studio in the hope of getting the best possible service. As long as you are not using the Gemini thinking and research APIs for long-running computations, the APIs are very inexpensive to use.

what if the lie is a logical deduction error not a fact retrieval error

The error rate would still be improved overall and might make it a viable tool for the price depending on the usecase.

I think it's probably actually better at math. Though still not enough to be useful in my research in a substantial way. Though I suspect this will change suddenly at some point as the models move past a certain threshold (also it is heavily limited by the fact that the models are very bad at not giving wrong proofs/counterexamples) so that even if the models are giving useful rates of successes, the labor to sort through a bunch of trash makes it hard to justify.

OK but if the verification loop really makes the agents MUCH more useful, then this usefulness difference can be used as a training signal to improve the agents themselves. So this means the current capabilities levels are certainly not going to remain for very long (which is also what I expect but I would like to point out it's also supported by this)

Thats a strong RL technique that could equal the quality of RLHF.

uh no it's not solved by looping over 4 digit numbers when it uses tools


didn't the US just allow H200s to China last few days btw


Ah yes, the Kasparov approach


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: