A/B testing is radioactive too. It's indirectly optimizing for user feedback - less stupid than directly optimizing for user feedback, but still quite dangerous.
Human raters are exploitable, and you never know whether the B has a genuine performance advantage over A, or just found a meat exploit by an accident.
It's what fucked OpenAI over with 4o, and fucked over many other labs in more subtle ways.
Are you talking about just preferences or A/B tests on like retention and engagement? The latter I think is pretty reliable and powerful though I have never personally done them. Preferences are just as big a mess: WHO the annotators are matters, and if you are using preferences as a proxy for like correctness, you’re not really measuring correctness you’re measuring e.g. persuasion. A lot of construct validity challenges (which themselves are hard to even measure in domain).
Yes. All of them are poisoned metrics, just in different ways.
GPT-4o's endless sycophancy was great for retention, GPT-5's style of ending every response in a question is great for engagement.
Are those desirable traits though? Doubt it. They look like simple tricks and reek of reward hacking - and A/B testing rewards them indeed. Direct optimization is even worse. Combining the two is ruinous.
Mind, I'm not saying that those metrics are useless. Radioactive materials aren't useless. You just got to keep their unpleasant properties in mind at all times - or suffer the consequences.
Human raters are exploitable, and you never know whether the B has a genuine performance advantage over A, or just found a meat exploit by an accident.
It's what fucked OpenAI over with 4o, and fucked over many other labs in more subtle ways.