Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is almost perfect.

The gold standard for LLM evaluation would have the following qualities:

1. Categorized (e.g. coding, reasoning, general knowledge)

2. Multimodal (at least text and image)

3. Multiple difficulties (something like "GPT-4 saturates or scores >90%", a la MMLU, "GPT-4 scores 20-80%", and "GPT-4 scores < 10%")

4. Hidden (under 10% of the dataset publicly available, enough methodological detail to inspire confidence but not enough to design to the test set)

The standard model card suite with MMLU, HumanEval etc. has already been optimized to the point of diminishing value - Goodhart's law in action. Meanwhile, arena Elo (https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...) is extremely useful, but also has the drawback of reflecting median-voter preferences that will not necessarily correlate with true intelligence as capabilities continue to advance, in the same sense as how the doctor with the best bedside manner is not necessarily the best doctor.

Until that happens, I'll pay attention to every eval I can find, but am also stuck asking "how many r's are in strawberry?" and "draw a 7-sided stop sign" to get a general impression of intelligence independent of gameable or overly general benchmarks.

But all that aside:

       Model       |   Score
----------------------------------------------

       GPT-4o      |    52

   Llama 3.1 405B  |    50

 Claude 3.5 Sonnet |    46

   Mistral Large   |    44

  Gemini 1.5 Pro   |    12
What an incredible contrast to MMLU, where all of these models score in the 80-90% range! For what it's worth, these scores also fall much closer to my impressions from daily use. Gemini is awful, Sonnet and 4o are amazing, and the new Llama puts fine-tunable, open-source 4o in the hands of anyone with a mini-cluster.


Very audacious to call it "almost perfect" when it has only what appears to be 50 questions. For comparison, MMLU contains 57 tasks and more than 100k questions.


I've found Claude 3.5 Sonnet actually much worse on average for coding than Claude 3 Opus.

At least for my use case and when interfaced with through Kagi. Much higher hallucination rate.

GPT-4o hallucinates far less than Claude 3 Opus but also seems to have less niche knowledge (I was using it to assist with Groovy+Spock+Spring upgrade)

So I question a lot of the benchmarks published on the newest models. They don't seem to track linearly/accurately with my use cases


> Hidden

It's pretty absurd that it can be a criteria for a good benchmark. You should dismiss any paper (e.g. benchmark result) that isn't repeatable, and by definition closed testing dataset is not repeatable and in fact doesn't provide much insight anyway, you can as well call it arbitrary curated rating, like those that useless journalists do ("top 50 most influential women of all time").

Obviously, this contradicts the requirement that testing set shouldn't be a subset of training set. It is kinda reasonable to assume that if you can access data on the internet, OpenAI also can. Unless all agree to respect at least one robots.txt file on the internet, and even then somebody can just repost something.

I don't have a solution. I'm just saying this is complete bullshit, the idea that we are starting to exclaim "yay, hidden data! I can trust that!" just cannot be acceptable.


One way to handle this might be to have the data hidden, but verifiable in the future. That is: publish a signed hash of the benchmark questions, and every X amount of time swap them out and publish the old ones.

> I don't have a solution. I'm just saying this is complete bullshit, the idea that we are starting to exclaim "yay, hidden data! I can trust that!" just cannot be acceptable.

This feels just unkind. If you acknowledge that it's a hard (impossible?) problem, you should give some leeway to people doing their best until a consensus on the right approach(es) exists.


9




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: