I've benchmarked a lot of these newest AI models on private problems that requir...

I've benchmarked a lot of these newest AI models on private problems that require only insight, no clever techniques, since the first reasoning preview came out (o1?) a year ago.

The common theme I've seen is that AI will just throw "clever tricks" and then call it a day.

For example, a common game theory operation that involves xor is Nim. Give it a game theory problem that involves xor, but doesn't relate to Nim at all, and it will throw a bunch of "clever" Nim tricks at the problem that are "well known" to be clever in the literature, but don't actually remotely apply, and it will make up a headcanon about how it's correct.

It seems like AI has maybe the actual reasoning of a 5th grader, but the knowledge of a PhD student. A toddler with a large hammer.

Also, keep in mind that it's not stated if GPT-5 has access to python, google, etc. while doing these benchmarks, which certainly makes it easier. A lot of these problems are gated by the fact that you only have ~12 minutes to solve it, while AI can go through so many solutions at once.

No matter what benchmarks it passes, even the IMO (as someone who's been in the maths community for a long time), I will maintain the position that, none of your benchmarks matter to me until it can actually replace my workflow and creative insights. Trust with your own eyes and experiences, not whatever hype marketing there is.