Would love if OpenAI did more of these types of posts. Off the top of my head, I'd like to understand:
- The sepia tint on images from gpt-image-1
- The obsession with the word "seam" as it pertains to coding
Other LLM phraseology that I cannot unsee is Claude's "___ is the real unlock" (try google it or search twitter!). There's no way that this phrase is overrepresented in the training data, I don't remember people saying that frequently.
It was always funny how easy it was to spot the people using a Studio Ghibli style generated avatar for their Discord or Slack profile, just from that yellow tinging. A simple LUT or tone-mapping adjustment in Krita/Photoshop/etc. would have dramatically reduced it.
The worst was you could tell when someone had kept feeding the same image back into chatgpt to make incremental edits in a loop. The yellow filter would seemingly stack until the final result was absolutely drenched in that sickly yellow pallor, made any photorealistic humans look like they were all suffering from advanced stages of jaundice.
This is just the model converging on some kind of average found in its training data distribution. Here you can see the same concept starting from Dwayne Johnson and then converging to some kind of digital neo-expressionist doodle: https://www.reddit.com/r/ChatGPT/comments/1kbj71z/i_tried_th...
If there's a hint of sepia in the original image and the training data contains a lot of sepia images, it will certainly get reinforced in this process. And the original distracted boyfriend meme certainly has some strong sepia tones in the background. Same way that Dwayne Johnson's face looks a tad cartoonish. And in the intermediate steps they both flow towards some averaged human representation that seems pretty accurate if you consider the real world's ethnic distribution.
“Catbox has been running for 11 years now, and for 9 of those years, growth was pretty linear. Traffic goes up, storage used goes up, support goes up. This is the “organic” nature of Catbox. For the last 2 years, amplifying in the last 6 months, both storage used has gone up significantly compared to traffic and support. I first investigated this last year around May, as it was starting to put pressure on the storage space available to Catbox. I was able to find that most of the storage used was from a handful (35 or so) IP addresses that were uploading over 500 GB of content to Catbox in very short spans of time anonymously. After purging those uploads and banning those IP addresses, things seemed to be fine, however later last year, around September, disk consumption began to increase exponentially again compared to traffic. Doing what I could to mitigate it at that time involved Project Lain, as well as light monitoring of high usage IP addresses, like before. However this time there was no “super users” that were eating up storage. I let it be for a bit while purging a couple that I could. This problem increased even more in the last 60 days, to the point where I was burning through around 200-300 GB per day. Review of upload data shows hundreds of datacenter and proxy service IP addresses uploading 10-20 GB each for a few days, then dropping off. Looking at the files, it’s various “slop” content, including:
- Low resolution AI generated porn
- Tiktok Videos from the Middle East and SEA
- Clearly scraped LinkedIn/publicly available photos
- blob files containing junk data
Clearly this is not the “organic” traffic I mentioned earlier, and since the IP addresses are so varied, it’s clear something is happening here. I was alerted by someone that Claude will use Catbox in its coding projects as a “dumping ground” of sorts for when it needs to redirect content. This is clearly an abuse of the service, and stops today. …”
For me, the worst part is how these ghouls manage to ruin everything with their bullshit technology. Once they touch something unique and make it "AI" it just gets ruined. Now whenever I see something resembling that style, I have to assume it's the bullshit AI. And that's just a minor nuisance - now every underdeveloped idiot uses it to "up their game" with consequences we are only going to understand completely in the upcoming years.
All GPTisms are like that. In moderation there's nothing wrong with any of them. But you start noticing them because a lot of people use these things, and c/p the responses verbatim (or now use claws, I guess). So they stand out.
I don't think it's training data overrepresentation, at least not alone. RLHF and more broadly "alignment" is probably more impactful here. Likely combined with the fact that most people prompt them very briefly, so the models "default" to whatever it was most straight-forward to get a good score.
I've heard plenty of "the system still had some gremlins, but we decided to launch anyway", but not from tens of thousands of people at the same time. That's "the catch", IMO.
Maybe the only solution to GPTisms is infinite context. If I'm talking to my coworker every day I would consciously recognize when I already used a metaphor recently and switch it up. However if my memory got reset every hour, I certainly might tell the same story or use the same metaphor over and over.
> However if my memory got reset every hour, I certainly might tell the same story or use the same metaphor over and over.
All people repeat the same stories and phraseology to some extent, and some people are as bad or worse than LLM chat bots in their predictability. I wonder if the latter have weak long-term memory on the scale of months to years, even if they remember things well from decades ago.
Honestly I think there is more to it - even with infinite context, the LLM needs some kind of intelligence to know what is noise and what is not, you resort to "thinking" - making it create garbage it then feeds to itself.
Learning a language is a big complex task, but it is far from real intelligence.
Another possibility is output watermarking. It's possible to watermark LLM generated text by subtly biasing the probability distribution away from the actual target distribution. Given enough text you can detect the watermark quite quickly, which is useful for excluding your own output from pre-training (unless you want it... plenty of deliberate synthetic data in SFT datasets now as this post-mortem makes clear).
I was told this was possible many years ago by a researcher at Google and have never really seen much discussion of it since. My guess is the labs do it but keep quiet about it to avoid people trying to erase the watermark.
I think the problem is that humans are not random, they are very biased. When you try to capture this bias with an LLM you get a biased pseudo random model
> the term originates from Michael Feathers Working Effectively with Legacy Code
I haven’t read the book but, taking the title and Amazon reviews at face value, I feel like this embodies Codex’s coding style as a whole. It treats all code like legacy code.
It's not in the top 10, but it's of the more well-known and widely recommended book in the software industry. I'd put it in the same bucket as "Clean Code" and maybe even "Domain Driven Design"; they're kinda from the same "thought school" in the software industry. So it's definitely over-represented in training data (I'd guess primarily in the form of articles and blog posts and educational material reiterating or rephrasing ideas from the book).
FWIW, I found the concept of "seams" from that book useful back when working on some legacy C++ monolithic code few years back, as TDD is a little more tricky than usual due to peculiarities of the language (and in particular its build model), and there it actually makes sense to know of different kind of "seams" and what they should vs. shouldn't be used for.
Maybe it all ultimately traces back to the book mentioned before, but I don't believe it's an obscure term in the circles of java-y enterprise code/DI. In fact the only reason I know the term is because that's how dependency injection was first defined to me (every place you inject introduces a "seam" between the class being injected and the class you're injecting into, which allows for easy testing). I can't remember where exactly I encountered that definition though.
For what it’s worth, there are many areas of programming where dependency injection is almost never used. Game dev, data science, and embedded systems, for example, rarely use dependency injection. It’s definitely most common in enterprise Java code and less common in Python, C, or C++. And even then, not everyone uses the term “seam”.
Isn't DI just most commonly used in (web) server code, and rarely outside of that? Now it happens that C and C++ have been a rare choice for such code for decades, whereas Java had the longest streak of holding the #1 spot. It almost certainly still is #1 in terms of "requests served/day" by a large margin, probably no longer is #1 for greenfield projects.
I like how your co-workers enjoy the language. I had a similar group of colleagues once who did similar pre LLM but with words in popular culture, very playful.
In the future these tells will be more identifiable. We will be easier to point back at text and code written in 2026 and more confidently say "this was written by an LLM". It takes time for patterns to form and takes time for it to be noticeable. "Smoking gun was so early 2026 claude".I find thinking of the future looking at now to be refreshing perspective on our usage.
I’m a British English speaker and find the use of cliched American idioms really quite disgusting. Don’t want to think about about ballparks, home runs, smoking guns, going all in, touchdowns or hitting it out the park.
Ironically (or not) I've seen smoking gun attributed to Arthur Conan Doyle in a Sherlock Holmes story. (It was smoking pistol in that story). Even if that's rubbish, I think that one is common across the English speaking world. The baseball/American football stuff is a bit different. In the commonwealth we might say "Hit for six" instead of hitting it out of the park. There are a bunch of other ones related to sports more common in England like snookered, own-goal, red card, etc.
It actually probably wouldn’t be too expensive or difficult to finetune those sayings out of default behavior if it were made accessible to you, you could even automate most of the relabeling by having the model come up with a list of idioms and appropriate replacement terms so it calls eg cookies biscuits or removes references to baseball. Absolute bollocks they don’t offer that as a simple option anymore
In my user instructions I always have a point to "always use British English" which seems to reduce Americanisms. I am yet to see Claude give me a "back of the net!" though, sadly.
Claude, at least 4.5, not checked recently, has/had an obsession with the number 47 (or numbers containing 47). Ask it to pick a random time or number, or write prose containing numbers, and the bias was crazy.
Humans tend to be biased towards 47 as well. It’s almost halfway between 1 and 100 and prime so you’ll find people picking it when they have to choose a random number.
The whole blue 7 thing [1] and variations is very fascinating, but we don't tend to repeatedly pick the same number in the same exact context, though. That's what made this stand out to me - I had a document where Claude had picked 47 for "random" things dozens of times.
I experienced this even second hand when a coworker excitedly told of an encounter with a cold reader, and I knew the answer would be blue 7 before he told me what his guess was. Just his recap of the conversation was enough.
i just want to know where emdash came from, as it is quite rare to see it on the public internet, so it must have been synthetically added to the dataset.
Emdash is very common in academic journals and professional writing. I remember my English professor in the early 2000s encouraging us to use it, it has a unique role in interrupting a sentence. Thoughtfully used, it conveys a little more editorial effort, since there is no dedicated key on the keyboard. It was disappointing to see it become associated with AI output.
The very simplified answer is that the models are first trained on everything and then are later trained more heavily on golden samples with perfect grammar, spelling, etc..
I think it's because of Wordpress sites, as their titles often have them and the editor automatically turns things into them. A large part of the Internet has been powered by WP.
Other than things other comments already mention, let's not forget that Microsoft Word auto-corrects "--" to em-dash, and so does (apparently - haven't checked myself) Outlook, Apple Pages, Notes and Mail. There's probably bunch of other such software (I vaguely recall Wordpress doing annoying auto-typography on me, some 15 years ago or so).
It has been rare. It's common now, even in meaningful human texts. (I know because I detest the correct usage without spaces, t looks wrong.) One of the ways AI is shaping our minds.
I had the feeling they didn't really answer the questions, that is why the goblins appeared. They simply "retired the “Nerdy” personality" because they couldn't fix it and went on.
One I saw recently was "wires" and "wired" from opus.
It was using it like every 3rd sentence and I was like, yeah I have seen people say wired like this but not really for how it was using it in every sentence.
GPT started to ‘wire in’ stuff around 5.2 or 5.3 and clearly Opus, ahem, picked it up. I remember being a tiny bit shocked when I saw ‘wired’ for the first time in an Anthropic model.
Everybody training models on large amounts of lightly filtered internet text is partially distilling every other model that had its output posted verbatim to the internet.
ChatGPT has a whole host of weird words that it uses about coding - anything changed is a “pass” done over the code, it loves talking about “chrome” in the UI, it’s always saying “I’m going to do X, not [something stupid that nobody would ever think of doing]”
> The obsession with the word "seam" as it pertains to coding
I quite liked this term when it started using it. And I appreciate the consistent way it talks about coding work even when working on radically different stacks and codebases
"Seam" has been stretched by AI from its original legacy-code context to any point in code where something can be plugged in. I actually asked an AI about this a few weeks ago because I was surprised by the consistent, frequent use of "seam".
Frequent words I see from GPT: "shape", "seam", "lane", "gate" (especially as verb), "clean", "honest", "land", "wire", "handoff", "surface" (noun), "(un)bounded", "semantics" (but this one is fair enough), and sometimes "unlock"
It feels like AI really likes to pick the shortest ways to express ideas even if they aren't the most common, which I suppose would make sense if that's actually what's happening.
Whenever Claude finishes some work it almost always says “Clean.” before finishing its closing remarks. It’s at the point where I repeat it out loud along with Claude to highlight the absurdity of the repetition.
With 4.5, I think because I would prompt it/guide it towards an outcome by calling it “the dream: <code example>” it would get almost reverential / shocked with awe as it got closer to getting it working or when it finally passed for the first time. Which was funny and reasonably context appropriate but sometimes felt so over the top that I couldn’t tell if it also “liked” the project/idea or if I had somehow accidentally manipulated it into assigning religious purpose to the task of unix-style streaming rpcs.
I think a lot of the “clean” stuff stems from system prompts telling it to behave in a certain way or giving it requirements that it later responds to conversationally.
Total aside: I actually really dislike that these products keep messing around with the system prompts so much, they clearly don’t even have a good way to tell how much it’s going to change or bias the results away from other things than whatever they’re explicitly trying to correct, and like why is the AI company vibe-prompting the behavior out when they can train it and actually run it against evals.
No foo. No bar. Only baz and qux. All writing is like a bad tech blog -- with language that mimics humanity. Yet is alien.
The smoking gun is extra wording. Typically simple language. Dense in tokens -- shallow in content. Repeating itself ad nauseam. Saying the same thing in different ways. Feeding back upon itself. Not adding content. Not adding depth. Only adding words.
This release Mistral really reminds you of the gap between the frontier labs and everyone else.
Pre-agent, there wasn't always an obvious difference between models. Various models had their charms. Nowadays, I don't want to entertain anything less than the frontier models. The difference in capability is enormous and choosing anything less has a real cost in terms of productivity.
I've been a big fan of the smaller labs like Mistral and especially Cohere but it's been a while since I've been excited by a release by either company.
That said, I'm using mistral voxtral realtime daily – it's great.
Coding has always been the main real-world business usecase since day one. There has been no point since the very first public availability of GPT 3.5 in November 2022, that it wasn't.
A lot of us have been agentic coding since almost 2 years ago, mid-2024. I have. The productivity gap of "best vs 2nd vs 3rd best model" was biggest back then and has slowly been shrinking ever since.
> Pre-agent, there wasn't always an obvious difference between models. Various models had their charms. Nowadays, I don't want to entertain anything less than the frontier models. The difference in capability is enormous and choosing anything less has a real cost in terms of productivity.
It's just apples to oranges.
There is not a clear, across the board, winner on non-agentic tasks between Gemini, ChatGPT, and Claude - the simple chatbot interface.
But Claude Code is substantially better than Codex which itself is notably better than Gemini-cli.
In this vein, it should not be surprising that Claude Code is way better than non-frontier models for agentic coding... It's substantially better than other frontier models at specialized agentic tasks.
I’ve been comparing Claude Code and Codex extensively side by side over the past couple of weeks with my favorite prompting framework superpowers…
From my perspective, Claude Code is decidedly not better than Codex. They’re slightly different and work better together. I would have no issues dropping CC entirely and using codex 100%.
If you’re working off of “defaults”, in other words no custom prompting, Claude Code does perform a lot better out of the box. I think this matters, but if you’re a professional software developer, I’d make the case that you should be owning your tools and moving beyond the baked in prompts.
> Pre-agent, there wasn't always an obvious difference between models. Various models had their charms. Nowadays, I don't want to entertain anything less than the frontier models.
This is a very naive and misguided opinion. In most tasks, including complex coding tasks, you can hardly tell the difference between a frontier model and something like GPT4.1. You need to really focus on areas such as context window, tool calling and specific aspects of reasoning steps to start noticing differences. To make matters worse, frontier models are taking a brute force approach to results which ends up making them far more expensive to run, both in terms of what shows up on your invoice and how much more you have to wait to get any resemblance of output.
And I won't even go into the topic or local models.
It has worked well for me. I have since put this into my AGENTS.md:
I will check for the presence of `AGENTS.md` files in the workspace. When working in a subdirectory, I will check whether that subdirectory has its own `AGENTS.md` and follow its narrower instructions.
The press watching side of me only has questions. Why was this published by Blender and not Anthropic? What does this actually mean? That the blender team gets free claude code max subscriptions?
What it means is here[1]. Anthropic is paying €240k a year and in return they get some marketing in the form of a press release and a website mention, as well as someone to talk to.
The writing style is so refreshing. I am so tired of typical llm prose. Despite people's recent attempts to hide it, it's all so obvious. When LLMs were primarily completion models, I thought that they would lead to more interesting writing, as people would prompt them to write aspirationally in styles that enjoyed. I couldn't have been more wrong.
Indeed. I could be plausibly persuaded to put this model to use in the composition of the most mundane and least personal of business correspondence, a thing which to which I am resolutely opposed with regard to the current generation of large language models.
Github had, by far, the most easily game-able agent usage policy. People would force the agent to run a script before the end of turns that consisted entirely of `input("prompt: ")` so that you could essentially talk endlessly to an agent for the price of a turn. I see this less about the future of this industry and more about fighting the costs incurred by bad actors.
I never played any games like that, but simply giving the agent a clear exit criteria and instructions to check the exit criteria every time it thinks it's done on a complex task was often enough to keep it chugging away for most of a day on a single prompt in my experience. Per-prompt pricing just isn't sustainable period, even if everyone is acting in good faith.
I once asked it to do a comprehensive security review of our code. It churned for nearly an hour (and then produced 90% false positives). Insane that that usage was charged the same amount as me just saying "Hello".
If you do want a good reason to make fun of moleskine and not buy their products, it's because they're all extremely poorly made. I don't have a single moleskine where the pages haven't separated from the cover/spine after a few years.
Yeah. I've always loosely correlated pelican quality with big model smell but I'm not picking that up here. I thought this was supposed to be spud? Weird indeed.
Can someone explain how we arrived at the pelican test? Was there some actual theory behind why it's difficult to produce? Or did someone just think it up, discover it was consistently difficult, and now we just all know it's a good test?
I set it up as a joke, to make fun of all of the other benchmarks. To my surprise it ended up being a surprisingly good measure of the quality of the model for other tasks (up to a certain point at least), though I've never seen a convincing argument as to why.
What it has going for it is human interpretability.
Anyone can look and decide if it’s a good picture or not. But the numeric benchmarks don’t tell you much if you aren’t already familiar with that benchmark and how it’s constructed.
how can you say "it ended up being a surprisingly good measure of the quality of the model for other tasks" and also "It should not be treated as a serious benchmark" in the same comment?
if it is indeed a good measure of the quality of the model (hint: it's not) then, logically, it should be taken seriously.
this is, sadly, a great example of the kind of doublethink the "AI" hypesters (yes - whether you like it or not simon - that is what you are now) are all too capable of.
I genuinely don't see how those two statements conflict with each other.
Despite not being a serious benchmark (how could it be serious? It's a pelican riding a bicycle!) it still turned out to have some value. You can see that just by scrolling through the archives and watching it improve as the models improved.
If your definition of doublethink is "holding two conflicting ideas in your head at once" then I would say doublethink is a necessary skill for navigating the weird AI era we find ourselves inhabiting.
"some value" is not the same as "a surprisingly good measure of the quality of the model for other tasks".
doublethink does not mean holding two conflicting ideas in your head at once. it means holding two logically inconsistent positions/beliefs at the same time.
It all began with a Microsoft researcher showing a unicorn drawn in tikz using GPT4. It was an example of something so outrageous that there was no way it existed in the training data. And that's back when models were not multimodal.
Nowadays I think it's pretty silly, because there's surely SVG drawing training data and some effort from the researchers put onto this task. It's not a showcase of emergent properties.
It's interesting to see some semblance of spatial reasoning emerge from systems based on textual tokens. Could be seen as a potential proxy for other desirable traits.
It's meta-interesting that few if any models actually seem to be training on it. Same with other stereotypical challenges like the car-wash question, which is still sometimes failed by high-end models.
If I ran an AI lab, I'd take it as a personal affront if my model emitted a malformed pelican or advised walking to a car wash. Heads would roll.
Accessibility is a broad umbrella of features that enable a ton of really cool stuff for everybody, not just the disabled. Things like agentic computer use is only possible because of "accessibility".
The same accessibility stuff that makes screen readers work well also makes automated UI tests simpler and less brittle too (correct aria roles, accessible names, label relationships etc).
At face value they look pretty cool, yet they love adding redundant frills.
reply