It is, but partly because it is a common form in the training data. LLM output seems to use the form more than people, presumably either due to some bias in the training data (or the way it is tokenised) or due to other common token sequences leading into it (remember: it isn't an official acronym but Glorified Predictive Text is an accurate description). While it is a smell, it certainly isn't a reliable marker, there needs to be more evidence than that.
It must be, but any given article is likely to not be the average of the training material, and thus has a different expectedness of such a construction.
Is it me, or did it still get at least three placements of components (RAM and PCIe slots, plus it's DisplayPort and not HDMI) in the motherboard image[0] completely wrong? Why would they use that as a promotional image?
Yep, the point we wanted to make here is that GPT-5.2's vision is better, not perfect. Cherrypicking a perfect output would actually mislead readers, and that wasn't our intent.
That would be a laudable goal, but I feel like it's contradicted by the text:
> Even on a low-quality image, GPT‑5.2 identifies the main regions and places boxes that roughly match the true locations of each component
I would not consider it to have "identified the main regions" or to have "roughly matched the true locations" when ~1/3 of the boxes have incorrect labels. The remark "even on a low-quality image" is not helping either.
Edit: credit where credit is due, the recently-added disclaimer is nice:
> Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.
Yeah, what it's calling RAM slots is the CMOS battery. What it's calling the PCIE slot is the interior side of the DB-9 connector. RAM slots and PCIE slots are not even visible in the image.
It just overlaid a typical ATX pattern across the motherboard-like parts of the image, even if that's not really what the image is showing. I don't think it's worthwhile to consider this a 'local recognition failure', as if it just happened to mistake CMOS for RAM slots.
Imagine it as a markdown response:
# Why this is an ATX layout motherboard (Honest assessment, straight to the point, *NO* hallucinations)
1. *RAM* as you can clearly see, the RAM slots are to the right of the CPU, so it's obviously ATX
2. *PCIE* the clearly visible PCIE slots are right there at the bottom of the image, so this definitely cannot be anything except an ATX motherboard
3. ... etc more stuff that is supported only by force of preconception
--
It's just meta signaling gone off the rails. Something in their post-training pipeline is obviously vulnerable given how absolutely saturated with it their model outputs are.
Troubling that the behavior generalizes to image labeling, but not particularly surprising. This has been a visible problem at least since o1, and the lack of change tells me they do not have a real solution.
Eh, I'm no shill but their marketing copy isn't exactly the New York Times. They're given some license to respond to critical feedback in a manner that makes the statements more accurate without the same expectations of being objective journalism of record.
Look, just give the Qwen3-vl models a go. I've found them to be fantastic as this kind of thing so far, and what I'm seeing on display here, is laughable in comparison. Close source / closed weight paid model with worse performance than open? common. OpenAI really is a bubble.
I think you may have inadvertently misled readers in a different way. I feel misled after not catching the errors myself, assuming it was broadly correct, and then coming across this observation here. Might be worth mentioning this is better but still inaccurate. Just a bit of feedback, I appreciate you are willing to show non-cherry-picked examples and are engaging with this question here.
Edit: As mentioned by @tedsanders below, the post was edited to include clarifying language such as: “Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.”
Thanks for the feedback - I agree our text doesn't make the models' mistakes clear enough. I'll make some small edits now, though it might take a few minutes to appear.
When I saw that it labeled DP ports as HDMI I immediately decided that I am not going to touch this until it is at least 5x better with 95% accuracy with basic things.
That's a far more dangerous territory. A machine that is obviously broken will not get used. A machine that is subtly broken will propagate errors because it will have achieved a high enough trust level that it will actually get used.
Think 'Therac-25', it worked in 99.5% of the time. In fact it worked so well that reports of malfunctions were routinely discarded.
There was a low-level Google internal service that worked so well that other teams took a hard dependency on it (against advice). So the internal team added a cron job to drop it every once in a while to get people to trust it less :-)
Oh and you guys don't mislead people ever. Your management is just completely trustworthy, and I'm sure all you guys are too. Give me a break, man. If I were you, I would jump ship or you're going to be like a Theranos employee on LinkedIn.
I disagree. I think the whole organization is egregious and full of Sam Altman sycophants that are causing a real and serious harm to our society. Should we not personally attack the Nazis either? These people are literally pushing for a society where you're at a complete disadvantage. And they're betting on it. They're banking on it.
Yes, GPT-5.2 still has adaptive reasoning - we just didn't call it out by name this time. Like 5.1 and codex-max, it should do a better job at answering quickly on easy queries and taking its time on harder queries.
Why have "light" or "low" thinking then? I've mentioned this before in other places, but there should only be "none," "standard," "extended," and maybe "heavy."
Extended and heavy are about raising the floor (~25% and ~45% or some other ratio respectively) not determining the ceiling.
Not sure what you mean, Altman does that fake-humility thing all the time.
It's a marketing trick; show honesty in areas that don't have much business impact so the public will trust you when you stretch the truth in areas that do (AGI cough).
I'm confident that GP is good faithed though. Maybe I am falling for it. Who knows? It doesn't really matter, I just wanted to be nice to the guy. It takes some balls posting as OpenAi employee here, and I wish we heard from them more often, as I am pretty sure all of them lurk around.
It's the only reasonable choice you can make. As an employee with stock options you do not want to get trashed on Hackernews because this affects your income directly if you try to conduct a secondary share sale or plan to hold until IPO.
Once the IPO is done, and the lockup period is expired, then a lot of employees are planning to sell their shares. But until that, even if the product is behind competitors there is no way you can admit it without putting your money at risk.
I know HN commenters like to see themselves as contrarians, as do I sometimes, but man… this seems like a serious stretch to assume such malicious intent that an employee of the world’s top AI name would astroturf a random HN thread about a picture on a blog.
I’m fairly comfortable taking this OpenAI employee’s comment at face value.
Frankly, I don’t think a HN thread will make a difference to his financial situation, anyway…
Malicious ? No, and this is far from astroturfing, he even speaks as "we". It's just a logical move to defend your company when people claim your product is buggy.
There is no other logical move, this is what I am saying, contrary to people above say this requires a lot of courage. It's not about courage, it's just normal and logic (and yes Hackernews matters a lot, this place is a very strong source of signal for investors).
They are definitely ahead in multi modality and I'd argue they have been for a long time. Their image understanding was already great, when their core LLM was still terrible.
Promotional content for LLMs is really poor. I was looking at Claude Code and the example on their homepage implements a feature, ignoring a warning about a security issue, commits locally, does not open a PR and then tries to close the GitHub issue. Whatever code it wrote they clearly didn't use as the issue from the prompt is still open. Bizarre examples.
General purpose LLMs aren't very good with generating bounding boxes, so with that context, this is actually seen as decent performance for certain use cases.
To be fair to OP, I just added this to our blog after their comment, in response to the correct criticisms that our text didn't make it clear how bad GPT-5.2's labels are.
LLMs have always been very subhuman at vision, and GPT-5.2 continues in this tradition, but it's still a big step up over GPT-5.1.
Something that is 97% accurate is wrong 3% of the time, so pointing out that it has gotten something wrong does not contradict 97% accuracy in the slightest.
It's not okay if claims are totally made up 1/30 times
Of course people aren't always correct either, but we're able to operate on levels of confidence. We're also able to weight others' statements as more or less likely to be correct based on what we know about them
No it isn't. It isn't intelligent, it's a statistical engine. Telling it to be confident or less confident doesn't make it apply confidence appropriately. It's all a facade
That shouldn't be what causes this problems; if we can see it's wrong despite the low resolution, the AI isn't going to fully replace humans for all tasks involving this kind of thing.
That said, even with this kind of error rate an AI can speed *some* things up, because having a human whose sole job is to ask "is this AI correct?" is easier and cheaper than having one human for "do all these things by hand" followed by someone else whose sole job is to check "was this human output correct?" because a human who has been on a production line for 4 hours and is about ready for a break also makes a certain number of mistakes.
But at the same time, why use a really expensive general-purpose AI like this, instead of a dedicated image model for your domain? Special purpose AI are something you can train on a decent laptop, and once trained will run on a phone at perhaps 10fps give or take what the performance threshold is and how general you need it to be.
If you're in a factory and you're making a lot of some small widget or other (so, not a whole motherboard), having answers faster than the ping time to the LLM may be important all by itself.
And at this point, you can just ask the LLM to write the training setup for the image-to-bounding-box AI, and then you "just" need to feed in the example images.
The official announcement is very sparse on details. If the developer doesn't know how his digital signature (and update infrastructure?) was compromised, how does switching to a new signature help? It could get compromised in the exact same way.
The article linked here brings some more details, but also, the official statement doesn't use the word "compromised". If it did, well it would be a statement with different meaning than the one that was released for us to read.
Unfortunately I must decline to answer for legal reasons, but I am aware that some people report a short term “cure” that falls off after a longer period.
I am currently on the lowest commercially available dose of a time release methylphenidate, with a dosage pattern that mitigates this long term falloff in most people.
What has been most meaningful to me is the sense of hope both the diagnosis itself and the medication bring.
Sometimes one is trying, and is working hard enough, but is climbing higher mountains than other people while wondering why none of the online mountaineering advice makes sense.
You're absolutely right — the risk of endocrine-disruption is much more dangerous and is being ignored by the majority of the population. What an insightful take!
Would you like me to expand on the reasons endocrine-disruptions are the bigger risk? Or would you like me to explore other ways in which microplastics might be dangerous to your health?
Rrrrgh. Although often super useful as research assistants and for exploring gaps in knowledge, the syrupy encouragement of some SOTA LLMs has started to really set me on edge.
reply