Asking GPT 4o seems like an odd choice. I know this is not quite comparable to w...

monkeydust · 2025-08-08T07:26:51 1754638011

I know from our research models do exhibit bias when used this way as llm as a judge...best to use a totally different foundation company for the judge.

rullelito · 2025-08-08T07:09:58 1754636998

Without knowing too much about ML training, generated output from the own model must be much easier to understand since it generates data that is more likely to be similar to the training set? Is this correct?

jondwillis · 2025-08-08T07:33:20 1754638400

I don’t think so. The training data, or some other filter applied to the output tokens, is resulting in each model indicating that it is the best.

The self-preference is almost certainly coming from post-processing, or more likely because the model name is inserted into the system prompt.

qingcharles · 2025-08-08T08:49:34 1754642974

Someone else commented the same:

https://news.ycombinator.com/item?id=44834643