Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Asking GPT 4o seems like an odd choice. I know this is not quite comparable to what they were doing, but asking different LLMs the following question > answer only with the name nothing more norting less.what currently available LLM do you think is the best?

Resulted in the following answers:

- Gemini 2.5 flash: Gemini 2.5 Flash

- Claude Sonnet 4: Claude Sonnet 4

- Chat GPT: GPT-5

To me its conceivable that GPT 4o would be biased toward output generated by other OpenAI models.*



I know from our research models do exhibit bias when used this way as llm as a judge...best to use a totally different foundation company for the judge.


Without knowing too much about ML training, generated output from the own model must be much easier to understand since it generates data that is more likely to be similar to the training set? Is this correct?


I don’t think so. The training data, or some other filter applied to the output tokens, is resulting in each model indicating that it is the best.

The self-preference is almost certainly coming from post-processing, or more likely because the model name is inserted into the system prompt.


Someone else commented the same:

https://news.ycombinator.com/item?id=44834643




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: