Was just skimming along that review, while watching the live-stream, where they ...

dudeinhawaii · 2025-08-07T19:55:08 1754596508

This is why you need to have your own set of personal benchmarks. I have a few short stories that I have models continue (ones I wrote ages ago in my youth) or refactor. Some models are fantastic at writing but miss key details or enmesh them (Claude). Some are terrible writers at higher reasoning (o3). Some are decent writers but tend to provide very short outputs (gpt-4o). For my personal benchmarks Gemini 2.5 Pro has always generated the most compelling writing that _also_ sticks to the world/script -- and sometimes surprises me by having characters react in ways that I hadn't considered but are consistent with their "worldview" as presented by the context (usually a world guide).

I find coding to be harder to benchmark because there are so many ways to write the same solution. A "correct" solution may be terrible in another context due to loss of speed, security, etc.

diggan · 2025-08-07T21:24:04 1754601844

> I find coding to be harder to benchmark because there are so many ways to write the same solution. A "correct" solution may be terrible in another context due to loss of speed, security, etc.

Yeah, I also do my own benchmarks, but as you said, coding ones are a bit harder. Currently I'm mostly benchmarking the accuracy of the tools I've written, which do bite-sized work. One tool is for editing parts of files, another to rewrite files fully, and so on, and each is individually benchmarked. But they're very specific, I do no overall benchmark for "Change feature X to do Y" which would span a "session", haven't find any good way of evaluating the results, just like you :)

anshumankmr · 2025-08-08T01:04:04 1754615044

Here’s one of my favourite questions. Every single Model prior to this including 4o, o3 use to fumble this. https://photos.app.goo.gl/zzfKKqDYFtMAP9Vb8

I am yet to test this out end to end

tough · 2025-08-07T17:26:46 1754587606

well it's difficult to trust the people selling it in the first place. They're too biased to not lie

It's hard to make a man understand something standing between them and their salary

barrell · 2025-08-07T17:40:17 1754588417

I don’t think that’s a valid excuse. Yes marketing speak has always existed, but it’s not like companies have always been completely unreliable.

I found it strange that, despite my excitement for such an event being roughly equivalent to WWDC these days, I had 0 desire to watch the live stream for exactly this reason: it’s not like they’re going to give anything to us straight.

Even this years WWDC I at least skipped through video afterwards. Before I used to have watch parties. Yes they’re overly positive and paint everything in a good light, but they never felt… idk whatever the vibe is I get from these (applicable to OpenAI, Grok, Meta, etc)

It’s been just a few years of a revolutionary technology and already the livestreams are less appealing than the biggest corporations yearly events. Personally I find that sad

yen223 · 2025-08-07T22:58:36 1754607516

It has been pointed out, but GPT-5 is really five different models with wildly different capabilities under the hood. Which model get picked for the task at hand is, for now, not deterministic.

swyx · 2025-08-07T17:26:53 1754587613

better than 4o but worse than 4.5 is internally consistent. and ofc writing is extremely multidimensional.

WhitneyLand · 2025-08-07T17:44:36 1754588676

But that’s not what the review says:

“It’s actually worse at writing than GPT-4.5, and I think even 4o”

So the review is not consistent with the PR, hence the commenter expressing preference for outside sources.