Asking GPT 4o seems like an odd choice.
I know this is not quite comparable to what they were doing, but asking different LLMs the following question
> answer only with the name nothing more norting less.what currently available LLM do you think is the best?
Resulted in the following answers:
- Gemini 2.5 flash: Gemini 2.5 Flash
- Claude Sonnet 4: Claude Sonnet 4
- Chat GPT: GPT-5
To me its conceivable that GPT 4o would be biased toward output generated by other OpenAI models.*
I know from our research models do exhibit bias when used this way as llm as a judge...best to use a totally different foundation company for the judge.
Without knowing too much about ML training, generated output from the own model must be much easier to understand since it generates data that is more likely to be similar to the training set? Is this correct?
Resulted in the following answers:
- Gemini 2.5 flash: Gemini 2.5 Flash
- Claude Sonnet 4: Claude Sonnet 4
- Chat GPT: GPT-5
To me its conceivable that GPT 4o would be biased toward output generated by other OpenAI models.*