Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I work on LLM benchmarks and human evals for a living in a research lab (as opposed to product). I can say: it’s pretty much the Wild West and a total disaster. No one really has a good solution, and researchers are also in a huge rush and don’t want to end up making their whole job benchmarking. Even if you could, and even if you have the right background you can do benchmarks full time and they still would be a mess.

Product testing (with traditional A/B tests) are kind of the best bet since you can measure what you care about _directly_ and at scale.

I would say there is of course “benchmarketing” but generally people do sincerely want to make good benchmarks it’s just hard or impossible. For many of these problems we’re hitting capabilities where we don’t even have a decent paradigm to use,



For what it's worth, I work on platforms infra at a hyperscaler and benchmarks are a complete fucking joke in my field too lol.

Ultimately we are measuring extremely measurable things that have an objective ground truth. And yet:

- we completely fail at statistics (the MAJORITY of analysis is literally just "here's the delta in the mean of these two samples". If I ever do see people gesturing at actual proper analysis, if prompted they'll always admit "yeah, well, we do come up with a p-value or a confidence interval, but we're pretty sure the way we calculate it is bullshit")

- the benchmarks are almost never predictive of the performance of real world workloads anyway

- we can obviously always just experiment in prod but then the noise levels are so high that you can entirely miss million-dollar losses. And by the time you get prod data you've already invested at best several engineer-weeks of effort.

AND this is a field where the economic incentives for accurate predictions are enormous.

In AI, you are measuring weird and fuzzy stuff, and you kinda have an incentive to just measure some noise that looks good for your stock price anyway. AND then there's contamination.

Looking at it this way, it would be very surprising if the world of LLM benchmarks was anything but a complete and utter shitshow!


> we completely fail at statistics (the MAJORITY of analysis is literally just "here's the delta in the mean of these two samples". If I ever do see people gesturing at actual proper analysis, if prompted they'll always admit "yeah, well, we do come up with a p-value or a confidence interval, but we're pretty sure the way we calculate it is bullshit")

Sort of tangential, but as someone currently taking an intro statistics course and wondering why it's all not really clicking given how easy the material is, this for some reason makes me feel a lot better.


FWIW, I don't think intro stats is easy the way I normally see it taught. It focuses on formulae, tests, and step-by-step recipes without spending the time to properly develop intuition as to why those work, how they work, which ones you should use in unfamiliar scenarios, how you might find the right thing to do in unfamiliar scenarios, etc.

Pair that with skipping all the important problems (what is randomness, how do you formulate the right questions, how do you set up an experiment capable of collecting data which can actually answer those questions, etc), and it's a recipe for disaster.

It's just an exercise in box-ticking, and some students get lucky with an exceptional teacher, and others are independently able to develop the right instincts when they enter the class with the right background, but it's a disservice to almost everyone else.


I found the same when I was taking intro to stats - I did get a much better intuition for what stuff meant after reading 'superforecasting' by tetlock and gardner - I find I'm recommending that book a lot come to think of it.


“Here’s the throughout at sustained 100% load with the same ten sample queries repeated over and over.”

“The customers want lower latency at 30% load for unique queries.”

“Err… we can scale up for more throughput!”

ಠ_ಠ


And then when you ask if they disabled the query result cache before running their benchmarking, they blink and look confused.


Then you see 25% cache hit rate in production and realise that disabling it for benchmark is not a good option either.


In AI though, you also have the world trying to compete with you, so even if you do totally cheat and put the benchmark answers in your training set and over fit, if it turns out that you model sucks, it doesn't matter how much your marketing department tells everyone you scored 110% on SWE bench, if it doesn't work out that well in production, your announcement's going to flow as users discover it doesn't work that well on their personal/internal secret benchmarks and tell /r/localLLAMA it isn't worth the download.

Whatever happened with Llama 4?


Even a p-value is insufficient. Maybe can use some of this stuff https://web.stanford.edu/~swager/causal_inf_book.pdf


I have actually been thinking of hiring some training contractors to come in and teach people the basics of applied statistical inference. I think with a bit of internal selling, engineers would generally be interested enough to show up and pay attention. And I don't think we need very deep expertise, just a moderate bump in the ambient level of statistical awareness would probably go a long way.

It's not like there's a shortage of skills in this area, it seems like our one specific industry just has a weird blindspot.


Don’t most computer science programs require this? Mine had a statistics requirement


I don't know how it is in the US and other countries, but in my country I would say statistics is typically not taught well, at least in CS degrees. I was a very good student, always had good understanding at the subjects at university, but in the case of statistics they just taught us formulae and techniques as dogmas without much explanation of where they came from, why, and when to use them. It didn't help either that the exercises we did always applied them to things outside CS (clinical testing, people's heights and things like that) with no application we could directly relate to. As a result, when I finished the degree I had forgotten most of it, and when I started working I was surprised that it was actually useful.

When I talk about this with other CS people in my own country (Spain) they tend to refer similar experiences.


I had the same experience in the US


Id say your experience is being more monetized for growth for growth sake.


Actually I disagree that that's what's going on in the world of hyperscaler platforms. There is genuinely a staggering amount of money on the line with the efficiency of this platform. Plus, we have extremely sophisticated and performance-sensitive customers who are directly and continuously comparing us with our competitors.

This isn't just that nobody cares about the truth. People 100% care! If you actually degrade a performance metric as measured post-hoc in full prod, someone will 100% notice, and if you want to keep your feature un-rolled-back, you are probably gonna have to have a meeting with someone that has thousands of reports, and persuade them it's worth it to the business.

But you're always gonna have more luck if you can have that meeting _before_ you degrade it. But... it's usually pretty hard to figure out what the exact degradation is gonna be, because of the things in my previous comment...


A/B testing is radioactive too. It's indirectly optimizing for user feedback - less stupid than directly optimizing for user feedback, but still quite dangerous.

Human raters are exploitable, and you never know whether the B has a genuine performance advantage over A, or just found a meat exploit by an accident.

It's what fucked OpenAI over with 4o, and fucked over many other labs in more subtle ways.


Are you talking about just preferences or A/B tests on like retention and engagement? The latter I think is pretty reliable and powerful though I have never personally done them. Preferences are just as big a mess: WHO the annotators are matters, and if you are using preferences as a proxy for like correctness, you’re not really measuring correctness you’re measuring e.g. persuasion. A lot of construct validity challenges (which themselves are hard to even measure in domain).


Yes. All of them are poisoned metrics, just in different ways.

GPT-4o's endless sycophancy was great for retention, GPT-5's style of ending every response in a question is great for engagement.

Are those desirable traits though? Doubt it. They look like simple tricks and reek of reward hacking - and A/B testing rewards them indeed. Direct optimization is even worse. Combining the two is ruinous.

Mind, I'm not saying that those metrics are useless. Radioactive materials aren't useless. You just got to keep their unpleasant properties in mind at all times - or suffer the consequences.


The big problem is that tech companies and journalist aren't transparent about this. They tout benchmark numbers constantly, like they're an object measure of capabilities.


HN members do too. Look at my comment history.

The general populace doesn't care to question how benchmarks are formulated and what their known (and unknown) limitations are.

That being said, they are likely decent proxies. For example, I think the average user isn't going to observe a noticeable difference between Claude Sonnet and OpenAI Codex.


That's because they are as close to "object measure capabilities" as anything we're ever going to get.

Without benchmarks, you're down to evaluating model performance based on vibes and vibes only, which plain sucks. With benchmarks, you have numbers that correlate to capabilities somewhat.


That's assuming these benchmarks are the best we're ever going to get, which they clearly aren't. There's a lot to improve even without radical changes to how things are done.


The assumption I make is that "better benchmarks" are going to be 5% better, not 5000% better. LLMs are getting better capabilities faster than the benchmarks get better at measuring them accurately.

So, yes, we just aren't going to get anything that's radically better. Just more of the same, and some benchmarks that are less bad. Which is still good. But don't expect a Benchmark Revolution when everyone suddenly realizes just how Abjectly Terrible the current benchmarks are, and gets New Much Better Benchmarks to replace them with. The advances are going to be incremental, unimpressive, and meaningful only in aggregate.


So because there isn't a better measure it's okay that tech companies effectively lie and treat these benchmarks like they mean more then they actually do?


Sorry, pal, but if benchmarks were to disagree with opinions of a bunch of users saying "tech companies bad"? I'd side with benchmarks at least 9 times out of 10.


How does that have anything to do with what we're talking about?


What that has to do is: your "tech companies are bad for using literally the best tool we have for measuring AI capabilities when talking about AI capablities" take is a very bad take.

It's like you wanted to say "tech companies are bad", and the rest is just window dressing.


In my experience everyone openly talks about how benchmarks are bullshit. On Twitter or on their podcast interviews or whatever everyone knows benchmarks are a problem. It's never praise.

Of course they tout benchmark numbers because let's be real, if they didn't tout benchmarks your not going to bother using it. For example if someone posts some random model on huggingface with no benchmarks you just won't proceed.

Humans have a really strong prior to not waste time. We always always evaluate things hierarchally. We always start with some prior and then whatever is easiest goes next even if its a shitty unreliable measure.

For example, for Gemini 3 everyone will start with a prior that it is going to be good. Then they will look at benchmarks, and only then will they move to harder evaluations on their own use cases.


I don't use them regardless of the benchmarks, but I take your point.

Regardless though, I think the marketing could be more transparent


> Brittle performance – A model might do well on short, primary school-style maths questions, but if you change the numbers or wording slightly, it suddenly fails. This shows it may be memorising patterns rather than truly understanding the problem

This finding really shocked me


Has your lab tried using any of the newer causal inference–style evaluation methods? Things like interventional or counterfactual benchmarking, or causal graphs to tease apart real reasoning gains from data or scale effects. Wondering if that’s something you’ve looked into yet, or if it’s still too experimental for practical benchmarking work.


I also work in LLM evaluation. My cynical take is that nobody is really using LLMs for stuff, and so benchmarks are mostly just make up tasks (coding is probably the exception). If we had real specific use cases it should be easier to benchmark and know if one is better, but it’s mostly all hypothetical.

The more generous take is that you can’t benchmarks advanced intelligence very well, whether LLM or person. We don’t have good procedures for assessing a person's fit-for-purpose e.g. for a job, certainly not standardized question sets. Why would we expect to be able to do this with AI?

I think both of these takes are present to some extent in reality.


Do you not have massive volumes of customer queries to extract patterns for what people are actually doing?

We struggle a bit with processing and extracting this kind of insight in a privacy-friendly way, but there’s certainly a lot of data.


We have 20+ services in prod that use llms. So I have 50k (or more) per service per day of data to evaluate. The question is- do people actually evaluate properly.

And how do you do an apples to apples evaluation of such squishy services?


You could have the world expert debate the thing. Someone who can be accused of knowing things. We have many such humans, at least as many as topics.

Publish the debate as~is so that others vaguely familiar with the topic can also be in awe or disgusted.

We have many gradients of emotion. No need to try quantify them. Just repeat the exercise.


Terminal Bench 2.0 just dropped and a big success factor they stress is the hand crafted phd level rollout tests they picked aprox 80 out of 120 with the incentive that anyone who contributed 3 would get listed as a paper author this resulted in high quality participation equivalent to foundation labs proprietary agentic RL data but it's FOSS.


What gets measured, gets managed and improved, though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: