This is *almost* perfect. The gold standard for LLM evaluation would have the fo...

up6w6 · on July 30, 2024

Very audacious to call it "almost perfect" when it has only what appears to be 50 questions. For comparison, MMLU contains 57 tasks and more than 100k questions.

adam_arthur · on July 30, 2024

I've found Claude 3.5 Sonnet actually much worse on average for coding than Claude 3 Opus.

At least for my use case and when interfaced with through Kagi. Much higher hallucination rate.

GPT-4o hallucinates far less than Claude 3 Opus but also seems to have less niche knowledge (I was using it to assist with Groovy+Spock+Spring upgrade)

So I question a lot of the benchmarks published on the newest models. They don't seem to track linearly/accurately with my use cases

krick · on July 30, 2024

> Hidden

It's pretty absurd that it can be a criteria for a good benchmark. You should dismiss any paper (e.g. benchmark result) that isn't repeatable, and by definition closed testing dataset is not repeatable and in fact doesn't provide much insight anyway, you can as well call it arbitrary curated rating, like those that useless journalists do ("top 50 most influential women of all time").

Obviously, this contradicts the requirement that testing set shouldn't be a subset of training set. It is kinda reasonable to assume that if you can access data on the internet, OpenAI also can. Unless all agree to respect at least one robots.txt file on the internet, and even then somebody can just repost something.

I don't have a solution. I'm just saying this is complete bullshit, the idea that we are starting to exclaim "yay, hidden data! I can trust that!" just cannot be acceptable.

kadoban · on July 31, 2024

One way to handle this might be to have the data hidden, but verifiable in the future. That is: publish a signed hash of the benchmark questions, and every X amount of time swap them out and publish the old ones.

> I don't have a solution. I'm just saying this is complete bullshit, the idea that we are starting to exclaim "yay, hidden data! I can trust that!" just cannot be acceptable.

This feels just unkind. If you acknowledge that it's a hard (impossible?) problem, you should give some leeway to people doing their best until a consensus on the right approach(es) exists.

sha-3 · on July 31, 2024