Hacker Newsnew | past | comments | ask | show | jobs | submit | postalcoder's commentslogin

I’m not saying this is an insult but I know a gpt 5.5 design when I see one!

At face value they look pretty cool, yet they love adding redundant frills.


Would love if OpenAI did more of these types of posts. Off the top of my head, I'd like to understand:

- The sepia tint on images from gpt-image-1

- The obsession with the word "seam" as it pertains to coding

Other LLM phraseology that I cannot unsee is Claude's "___ is the real unlock" (try google it or search twitter!). There's no way that this phrase is overrepresented in the training data, I don't remember people saying that frequently.


It was always funny how easy it was to spot the people using a Studio Ghibli style generated avatar for their Discord or Slack profile, just from that yellow tinging. A simple LUT or tone-mapping adjustment in Krita/Photoshop/etc. would have dramatically reduced it.

The worst was you could tell when someone had kept feeding the same image back into chatgpt to make incremental edits in a loop. The yellow filter would seemingly stack until the final result was absolutely drenched in that sickly yellow pallor, made any photorealistic humans look like they were all suffering from advanced stages of jaundice.


For context, an example of what happens when you feed the same image back in repeatedly: https://www.instagram.com/reels/DJFG6EDhIHs/

This is just the model converging on some kind of average found in its training data distribution. Here you can see the same concept starting from Dwayne Johnson and then converging to some kind of digital neo-expressionist doodle: https://www.reddit.com/r/ChatGPT/comments/1kbj71z/i_tried_th...

If there's a hint of sepia in the original image and the training data contains a lot of sepia images, it will certainly get reinforced in this process. And the original distracted boyfriend meme certainly has some strong sepia tones in the background. Same way that Dwayne Johnson's face looks a tad cartoonish. And in the intermediate steps they both flow towards some averaged human representation that seems pretty accurate if you consider the real world's ethnic distribution.


Haha fantastic. I'd love to see a comparison reel of that same image-loop for the entire image gen series (gpt-image-1, gpt-image-1.5, gpt-image-2).

Fixed points are a window to the soul of a LLM

- Lucretius in "De rerum natura", probably



0 bytes?

Hmm, sorry about that

Expires in 2d: https://streamable.com/dkyvu8

Edit: bad actors spamming Catbox

https://blog.catbox.moe/post/813932072453455872/happy-11th-b...

@Anthropic get your Claude under control:

  “Catbox has been running for 11 years now, and for 9 of those years, growth was pretty linear. Traffic goes up, storage used goes up, support goes up. This is the “organic” nature of Catbox. For the last 2 years, amplifying in the last 6 months, both storage used has gone up significantly compared to traffic and support. I first investigated this last year around May, as it was starting to put pressure on the storage space available to Catbox. I was able to find that most of the storage used was from a handful (35 or so) IP addresses that were uploading over 500 GB of content to Catbox in very short spans of time anonymously. After purging those uploads and banning those IP addresses, things seemed to be fine, however later last year, around September, disk consumption began to increase exponentially again compared to traffic. Doing what I could to mitigate it at that time involved Project Lain, as well as light monitoring of high usage IP addresses, like before. However this time there was no “super users” that were eating up storage. I let it be for a bit while purging a couple that I could. This problem increased even more in the last 60 days, to the point where I was burning through around 200-300 GB per day. Review of upload data shows hundreds of datacenter and proxy service IP addresses uploading 10-20 GB each for a few days, then dropping off. Looking at the files, it’s various “slop” content, including:

  - Low resolution AI generated porn
  - Tiktok Videos from the Middle East and SEA
  - Clearly scraped LinkedIn/publicly available photos
  - blob files containing junk data

  Clearly this is not the “organic” traffic I mentioned earlier, and since the IP addresses are so varied, it’s clear something is happening here. I was alerted by someone that Claude will use Catbox in its coding projects as a “dumping ground” of sorts for when it needs to redirect content. This is clearly an abuse of the service, and stops today. …”

catbox has been doing that for videos recently, don't know why. try https://www.vxinstagram.com/reels/DJFG6EDhIHs/

Thank you!

Not necessarily related but this would’ve been a big distraction:

https://blog.catbox.moe/post/813932072453455872/happy-11th-b...


I like how the AI seems forced to change their ethnicity to keep up with the color changes. Absolutely wild.

Enough internet for today

That is so creepy in a sci fi other worlds type way.

For me, the worst part is how these ghouls manage to ruin everything with their bullshit technology. Once they touch something unique and make it "AI" it just gets ruined. Now whenever I see something resembling that style, I have to assume it's the bullshit AI. And that's just a minor nuisance - now every underdeveloped idiot uses it to "up their game" with consequences we are only going to understand completely in the upcoming years.

Its called the piss filter

All GPTisms are like that. In moderation there's nothing wrong with any of them. But you start noticing them because a lot of people use these things, and c/p the responses verbatim (or now use claws, I guess). So they stand out.

I don't think it's training data overrepresentation, at least not alone. RLHF and more broadly "alignment" is probably more impactful here. Likely combined with the fact that most people prompt them very briefly, so the models "default" to whatever it was most straight-forward to get a good score.

I've heard plenty of "the system still had some gremlins, but we decided to launch anyway", but not from tens of thousands of people at the same time. That's "the catch", IMO.


Maybe the only solution to GPTisms is infinite context. If I'm talking to my coworker every day I would consciously recognize when I already used a metaphor recently and switch it up. However if my memory got reset every hour, I certainly might tell the same story or use the same metaphor over and over.

> However if my memory got reset every hour, I certainly might tell the same story or use the same metaphor over and over.

All people repeat the same stories and phraseology to some extent, and some people are as bad or worse than LLM chat bots in their predictability. I wonder if the latter have weak long-term memory on the scale of months to years, even if they remember things well from decades ago.


Honestly I think there is more to it - even with infinite context, the LLM needs some kind of intelligence to know what is noise and what is not, you resort to "thinking" - making it create garbage it then feeds to itself.

Learning a language is a big complex task, but it is far from real intelligence.


Another possibility is output watermarking. It's possible to watermark LLM generated text by subtly biasing the probability distribution away from the actual target distribution. Given enough text you can detect the watermark quite quickly, which is useful for excluding your own output from pre-training (unless you want it... plenty of deliberate synthetic data in SFT datasets now as this post-mortem makes clear).

I was told this was possible many years ago by a researcher at Google and have never really seen much discussion of it since. My guess is the labs do it but keep quiet about it to avoid people trying to erase the watermark.


I think the problem is that humans are not random, they are very biased. When you try to capture this bias with an LLM you get a biased pseudo random model

>with the word "seam" as it pertains to coding

I thought this was an established term when it comes to working with codebases comprised of multiple interacting parts.

https://softwareengineering.stackexchange.com/questions/1325...


thanks for this.

> the term originates from Michael Feathers Working Effectively with Legacy Code

I haven’t read the book but, taking the title and Amazon reviews at face value, I feel like this embodies Codex’s coding style as a whole. It treats all code like legacy code.


It's been a long time since I read it, but it was one of the better books I've read. It changed my approach to how to think about old code-bases.

It's not in the top 10, but it's of the more well-known and widely recommended book in the software industry. I'd put it in the same bucket as "Clean Code" and maybe even "Domain Driven Design"; they're kinda from the same "thought school" in the software industry. So it's definitely over-represented in training data (I'd guess primarily in the form of articles and blog posts and educational material reiterating or rephrasing ideas from the book).

FWIW, I found the concept of "seams" from that book useful back when working on some legacy C++ monolithic code few years back, as TDD is a little more tricky than usual due to peculiarities of the language (and in particular its build model), and there it actually makes sense to know of different kind of "seams" and what they should vs. shouldn't be used for.


No, it’s not an established term outside the mentioned books, beyond the generic meaning of the word.

I have frequently encountered the term in the context of unit testing and dependency injection.

Other references (and all predate chatgpt):

>Seams are places in your code where you can plug in different functionality

>Art of Unit Testing, 2nd edition page 54

(https://blog.sasworkshops.com/unit-testing-and-seams/)

>With the help of a technique called creating a seam, or subclass and override we can make almost every piece of code testable.

https://www.hodler.co/2015/12/07/testing-java-legacy-code-wi...

> seam; a point in the code where I can write tests or make a change to enable testing

https://danlimerick.wordpress.com/2012/06/11/breaking-hidden...

Maybe it all ultimately traces back to the book mentioned before, but I don't believe it's an obscure term in the circles of java-y enterprise code/DI. In fact the only reason I know the term is because that's how dependency injection was first defined to me (every place you inject introduces a "seam" between the class being injected and the class you're injecting into, which allows for easy testing). I can't remember where exactly I encountered that definition though.


For what it’s worth, there are many areas of programming where dependency injection is almost never used. Game dev, data science, and embedded systems, for example, rarely use dependency injection. It’s definitely most common in enterprise Java code and less common in Python, C, or C++. And even then, not everyone uses the term “seam”.

Isn't DI just most commonly used in (web) server code, and rarely outside of that? Now it happens that C and C++ have been a rare choice for such code for decades, whereas Java had the longest streak of holding the #1 spot. It almost certainly still is #1 in terms of "requests served/day" by a large margin, probably no longer is #1 for greenfield projects.

I can't say it isn't, but I have been writing code since about 2004 and this is the first time I've become aware that this is a thing.

The one phrase that irks me as overly dramatic and both GPT and Claude use it a lot is "__ is the real smoking gun!"

I'm a non-native English speaker, so maybe it's a really common idiom to use when debugging?


It probably was found in a bunch of meaningful code commit messages

My colleagues were joking about smoking guns yesterday after noticing that Claude was obsessed with it.

I like how your co-workers enjoy the language. I had a similar group of colleagues once who did similar pre LLM but with words in popular culture, very playful.

In the future these tells will be more identifiable. We will be easier to point back at text and code written in 2026 and more confidently say "this was written by an LLM". It takes time for patterns to form and takes time for it to be noticeable. "Smoking gun was so early 2026 claude".I find thinking of the future looking at now to be refreshing perspective on our usage.


I’m a British English speaker and find the use of cliched American idioms really quite disgusting. Don’t want to think about about ballparks, home runs, smoking guns, going all in, touchdowns or hitting it out the park.

Ironically (or not) I've seen smoking gun attributed to Arthur Conan Doyle in a Sherlock Holmes story. (It was smoking pistol in that story). Even if that's rubbish, I think that one is common across the English speaking world. The baseball/American football stuff is a bit different. In the commonwealth we might say "Hit for six" instead of hitting it out of the park. There are a bunch of other ones related to sports more common in England like snookered, own-goal, red card, etc.

That observation about Sherlock Holmes certainly puts the smackdown on me and gets you to home plate.

It actually probably wouldn’t be too expensive or difficult to finetune those sayings out of default behavior if it were made accessible to you, you could even automate most of the relabeling by having the model come up with a list of idioms and appropriate replacement terms so it calls eg cookies biscuits or removes references to baseball. Absolute bollocks they don’t offer that as a simple option anymore

Should send over a geezer to give them a slap.

In my user instructions I always have a point to "always use British English" which seems to reduce Americanisms. I am yet to see Claude give me a "back of the net!" though, sadly.

Crikey, you are correct!

> I'm a non-native English speaker, so maybe it's a really common idiom to use when debugging?

No. But it is something goblins say a lot.


Especially sleuth goblins...

Claude, at least 4.5, not checked recently, has/had an obsession with the number 47 (or numbers containing 47). Ask it to pick a random time or number, or write prose containing numbers, and the bias was crazy.

Also "something shifted" or "cracked".


Humans tend to be biased towards 47 as well. It’s almost halfway between 1 and 100 and prime so you’ll find people picking it when they have to choose a random number.

Then there’s the whole Pomona College thing https://en.wikipedia.org/wiki/47_(number)


The whole blue 7 thing [1] and variations is very fascinating, but we don't tend to repeatedly pick the same number in the same exact context, though. That's what made this stand out to me - I had a document where Claude had picked 47 for "random" things dozens of times.

[1] https://en.wikipedia.org/wiki/Blue%E2%80%93seven_phenomenon

I experienced this even second hand when a coworker excitedly told of an encounter with a cold reader, and I knew the answer would be blue 7 before he told me what his guess was. Just his recap of the conversation was enough.


I am biased towards 67

Funny, I didn't know there were 10 years old on hacker news!

The thirteen-year-olds are biased towards 69.

Maybe Claude is just a fan of Alias.

I just asked GPT 5.5 Thinking to choose any random 2 digit number. The result was indeed 47. Interesting.

Gemini gave 42

i just want to know where emdash came from, as it is quite rare to see it on the public internet, so it must have been synthetically added to the dataset.

Emdash is very common in academic journals and professional writing. I remember my English professor in the early 2000s encouraging us to use it, it has a unique role in interrupting a sentence. Thoughtfully used, it conveys a little more editorial effort, since there is no dedicated key on the keyboard. It was disappointing to see it become associated with AI output.

The very simplified answer is that the models are first trained on everything and then are later trained more heavily on golden samples with perfect grammar, spelling, etc..

I think it's because of Wordpress sites, as their titles often have them and the editor automatically turns things into them. A large part of the Internet has been powered by WP.

Other than things other comments already mention, let's not forget that Microsoft Word auto-corrects "--" to em-dash, and so does (apparently - haven't checked myself) Outlook, Apple Pages, Notes and Mail. There's probably bunch of other such software (I vaguely recall Wordpress doing annoying auto-typography on me, some 15 years ago or so).

Because on the public internet people don’t have arts degrees which are where emdash users learn to wield it correctly.

I learned about em-dashes by reading Knuth about 40 years ago.

although emdashes are not common on the internet, there are prevalent in books.

Logo_Daedalus tended to use it a lot

https://xcancel.com/Logo_Daedalus


`---` in TeX?

It has been rare. It's common now, even in meaningful human texts. (I know because I detest the correct usage without spaces, t looks wrong.) One of the ways AI is shaping our minds.

One I noticed with gemini, especially 3 flash: "this is the classic _____".

I had the feeling they didn't really answer the questions, that is why the goblins appeared. They simply "retired the “Nerdy” personality" because they couldn't fix it and went on.

"is the real" is such a strong Claude tell, whenever I encounter it, it makes me question what i'm reading.

Another I've noticed more recently is a slight obsession over refering to "Framing".


I miss being told “You’re absolutely right!” :’(

You're absolutely right. I was wrong in the first place

One I saw recently was "wires" and "wired" from opus.

It was using it like every 3rd sentence and I was like, yeah I have seen people say wired like this but not really for how it was using it in every sentence.


GPT started to ‘wire in’ stuff around 5.2 or 5.3 and clearly Opus, ahem, picked it up. I remember being a tiny bit shocked when I saw ‘wired’ for the first time in an Anthropic model.

Anthropic distills GPT?

Everybody training models on large amounts of lightly filtered internet text is partially distilling every other model that had its output posted verbatim to the internet.

And OpenAI probably distills anthropic, who would't?

It's all one big incestuous mess. In a couple of years we'll be talking about AI brainrot.


The number of things that Claude has told me are 'load-bearing' or 'belt-and-suspenders' is... very load-bearing

You are absolutely right to call that out!

for me, doing the heavy lifting is doing the heavy lifting

Fun fact: the word suffer comes from sub fer - under load, this relation (suffer - load bearing) is consistent across (unrelated) languages

Also too many lands and hits.

ChatGPT has a whole host of weird words that it uses about coding - anything changed is a “pass” done over the code, it loves talking about “chrome” in the UI, it’s always saying “I’m going to do X, not [something stupid that nobody would ever think of doing]”

gpt also loves talking about handwaving, "I'm going to do X, not just a hand-wavy victory lap"

> The obsession with the word "seam" as it pertains to coding

I quite liked this term when it started using it. And I appreciate the consistent way it talks about coding work even when working on radically different stacks and codebases


"Seam" has been stretched by AI from its original legacy-code context to any point in code where something can be plugged in. I actually asked an AI about this a few weeks ago because I was surprised by the consistent, frequent use of "seam".

Frequent words I see from GPT: "shape", "seam", "lane", "gate" (especially as verb), "clean", "honest", "land", "wire", "handoff", "surface" (noun), "(un)bounded", "semantics" (but this one is fair enough), and sometimes "unlock"

It feels like AI really likes to pick the shortest ways to express ideas even if they aren't the most common, which I suppose would make sense if that's actually what's happening.


Seams, spirals, codexes, recursion, glyphs, resonance, the list goes on and on.

Ask any LLM for 10 random words and most of them will give you the same weird words every time.

If you lower the temperature setting, it really will be the same 10 words every single attempt. :p

They are text completion algorithms with little randomness.

"shape" too, at least with gpt5.5, is coming up constantly.

I thought the “why it matters” headline was a funny reference to ChatGPT phraseology

Whenever Claude finishes some work it almost always says “Clean.” before finishing its closing remarks. It’s at the point where I repeat it out loud along with Claude to highlight the absurdity of the repetition.

With 4.5, I think because I would prompt it/guide it towards an outcome by calling it “the dream: <code example>” it would get almost reverential / shocked with awe as it got closer to getting it working or when it finally passed for the first time. Which was funny and reasonably context appropriate but sometimes felt so over the top that I couldn’t tell if it also “liked” the project/idea or if I had somehow accidentally manipulated it into assigning religious purpose to the task of unix-style streaming rpcs.

I think a lot of the “clean” stuff stems from system prompts telling it to behave in a certain way or giving it requirements that it later responds to conversationally.

Total aside: I actually really dislike that these products keep messing around with the system prompts so much, they clearly don’t even have a good way to tell how much it’s going to change or bias the results away from other things than whatever they’re explicitly trying to correct, and like why is the AI company vibe-prompting the behavior out when they can train it and actually run it against evals.


and "quietly"!

Short terse sentences. Never use commas.

Paragraph break.

No foo. No bar. Only baz and qux. All writing is like a bad tech blog -- with language that mimics humanity. Yet is alien.

The smoking gun is extra wording. Typically simple language. Dense in tokens -- shallow in content. Repeating itself ad nauseam. Saying the same thing in different ways. Feeding back upon itself. Not adding content. Not adding depth. Only adding words.


“I’ve got the shape of it now”

This release Mistral really reminds you of the gap between the frontier labs and everyone else.

Pre-agent, there wasn't always an obvious difference between models. Various models had their charms. Nowadays, I don't want to entertain anything less than the frontier models. The difference in capability is enormous and choosing anything less has a real cost in terms of productivity.

I've been a big fan of the smaller labs like Mistral and especially Cohere but it's been a while since I've been excited by a release by either company.

That said, I'm using mistral voxtral realtime daily – it's great.


Can't agree at all. Productivity gap just 1 year ago was much larger for frontier model vs non-frontier. Let alone 2 years ago.

Same. The gap is almost paper thin for anyone who hasn't gone full uninformed vibe code.

When I was thinking pre-agentic, I was actually thinking more pre-"coding seen as the main use case for these models".

Coding has always been the main real-world business usecase since day one. There has been no point since the very first public availability of GPT 3.5 in November 2022, that it wasn't.

A lot of us have been agentic coding since almost 2 years ago, mid-2024. I have. The productivity gap of "best vs 2nd vs 3rd best model" was biggest back then and has slowly been shrinking ever since.


> Pre-agent, there wasn't always an obvious difference between models. Various models had their charms. Nowadays, I don't want to entertain anything less than the frontier models. The difference in capability is enormous and choosing anything less has a real cost in terms of productivity.

It's just apples to oranges.

There is not a clear, across the board, winner on non-agentic tasks between Gemini, ChatGPT, and Claude - the simple chatbot interface.

But Claude Code is substantially better than Codex which itself is notably better than Gemini-cli.

In this vein, it should not be surprising that Claude Code is way better than non-frontier models for agentic coding... It's substantially better than other frontier models at specialized agentic tasks.


I’ve been comparing Claude Code and Codex extensively side by side over the past couple of weeks with my favorite prompting framework superpowers…

From my perspective, Claude Code is decidedly not better than Codex. They’re slightly different and work better together. I would have no issues dropping CC entirely and using codex 100%.

If you’re working off of “defaults”, in other words no custom prompting, Claude Code does perform a lot better out of the box. I think this matters, but if you’re a professional software developer, I’d make the case that you should be owning your tools and moving beyond the baked in prompts.


I think there's a fair amount of evidence that the heavy harnesses actually drag down performance compared to bare harnesses.

CC is not better than Codex, nor is it better than OpenCode, Crush, Pi etc…

> Pre-agent, there wasn't always an obvious difference between models. Various models had their charms. Nowadays, I don't want to entertain anything less than the frontier models.

This is a very naive and misguided opinion. In most tasks, including complex coding tasks, you can hardly tell the difference between a frontier model and something like GPT4.1. You need to really focus on areas such as context window, tool calling and specific aspects of reasoning steps to start noticing differences. To make matters worse, frontier models are taking a brute force approach to results which ends up making them far more expensive to run, both in terms of what shows up on your invoice and how much more you have to wait to get any resemblance of output.

And I won't even go into the topic or local models.


> You need to really focus on areas such as context window, tool calling and specific aspects of reasoning steps to start noticing differences.

This is like saying "the current models and the old models are the same if you ignore every important advance they've made"


Try a prompt like this: https://news.ycombinator.com/item?id=46809708#46821569 (first-person, exploratory prompting)

It has worked well for me. I have since put this into my AGENTS.md:

  I will check for the presence of `AGENTS.md` files in the workspace. When working in a subdirectory, I will check whether that subdirectory has its own `AGENTS.md` and follow its narrower instructions.

The press watching side of me only has questions. Why was this published by Blender and not Anthropic? What does this actually mean? That the blender team gets free claude code max subscriptions?

What it means is here[1]. Anthropic is paying €240k a year and in return they get some marketing in the form of a press release and a website mention, as well as someone to talk to.

[1]: https://fund.blender.org/corporate-memberships/



The writing style is so refreshing. I am so tired of typical llm prose. Despite people's recent attempts to hide it, it's all so obvious. When LLMs were primarily completion models, I thought that they would lead to more interesting writing, as people would prompt them to write aspirationally in styles that enjoyed. I couldn't have been more wrong.

Indeed. I could be plausibly persuaded to put this model to use in the composition of the most mundane and least personal of business correspondence, a thing which to which I am resolutely opposed with regard to the current generation of large language models.

Github had, by far, the most easily game-able agent usage policy. People would force the agent to run a script before the end of turns that consisted entirely of `input("prompt: ")` so that you could essentially talk endlessly to an agent for the price of a turn. I see this less about the future of this industry and more about fighting the costs incurred by bad actors.

I never played any games like that, but simply giving the agent a clear exit criteria and instructions to check the exit criteria every time it thinks it's done on a complex task was often enough to keep it chugging away for most of a day on a single prompt in my experience. Per-prompt pricing just isn't sustainable period, even if everyone is acting in good faith.

Charging by prompt was always wild to me.

I once asked it to do a comprehensive security review of our code. It churned for nearly an hour (and then produced 90% false positives). Insane that that usage was charged the same amount as me just saying "Hello".


If you do want a good reason to make fun of moleskine and not buy their products, it's because they're all extremely poorly made. I don't have a single moleskine where the pages haven't separated from the cover/spine after a few years.

I made pelicans at different thinking efforts:

https://hcker.news/pelican-low.svg

https://hcker.news/pelican-medium.svg

https://hcker.news/pelican-high.svg

https://hcker.news/pelican-xhigh.svg

Someone needs to make a pelican arena, I have no idea if these are considered good or not.


They are not good, and they seem to get worse as you increased effort. Weird

Yeah. I've always loosely correlated pelican quality with big model smell but I'm not picking that up here. I thought this was supposed to be spud? Weird indeed.

No but I can sense the movement, I think it's already reached the level of intelligence that draws it towards futurism or cubism /s

Can someone explain how we arrived at the pelican test? Was there some actual theory behind why it's difficult to produce? Or did someone just think it up, discover it was consistently difficult, and now we just all know it's a good test?

I set it up as a joke, to make fun of all of the other benchmarks. To my surprise it ended up being a surprisingly good measure of the quality of the model for other tasks (up to a certain point at least), though I've never seen a convincing argument as to why.

I gave a talk about it last year: https://simonwillison.net/2025/Jun/6/six-months-in-llms/

It should not be treated as a serious benchmark.


What it has going for it is human interpretability.

Anyone can look and decide if it’s a good picture or not. But the numeric benchmarks don’t tell you much if you aren’t already familiar with that benchmark and how it’s constructed.


how can you say "it ended up being a surprisingly good measure of the quality of the model for other tasks" and also "It should not be treated as a serious benchmark" in the same comment?

if it is indeed a good measure of the quality of the model (hint: it's not) then, logically, it should be taken seriously.

this is, sadly, a great example of the kind of doublethink the "AI" hypesters (yes - whether you like it or not simon - that is what you are now) are all too capable of.


I genuinely don't see how those two statements conflict with each other.

Despite not being a serious benchmark (how could it be serious? It's a pelican riding a bicycle!) it still turned out to have some value. You can see that just by scrolling through the archives and watching it improve as the models improved.

If your definition of doublethink is "holding two conflicting ideas in your head at once" then I would say doublethink is a necessary skill for navigating the weird AI era we find ourselves inhabiting.


"some value" is not the same as "a surprisingly good measure of the quality of the model for other tasks".

doublethink does not mean holding two conflicting ideas in your head at once. it means holding two logically inconsistent positions/beliefs at the same time.


It all began with a Microsoft researcher showing a unicorn drawn in tikz using GPT4. It was an example of something so outrageous that there was no way it existed in the training data. And that's back when models were not multimodal.

Nowadays I think it's pretty silly, because there's surely SVG drawing training data and some effort from the researchers put onto this task. It's not a showcase of emergent properties.



It's interesting to see some semblance of spatial reasoning emerge from systems based on textual tokens. Could be seen as a potential proxy for other desirable traits.

It's meta-interesting that few if any models actually seem to be training on it. Same with other stereotypical challenges like the car-wash question, which is still sometimes failed by high-end models.

If I ran an AI lab, I'd take it as a personal affront if my model emitted a malformed pelican or advised walking to a car wash. Heads would roll.


I tried getting it to generate openscad models, which seems much harder. Not had much joy yet with results.

G code and ascii art are also text formats, but seem to be beyond most if not all models.

(There are some that generate 3d models specifically, more in the image generation family than chatbot family.)


None of them have the pelican's feet placed properly on the pedals -- or the pedals are misrepresented. Cool art style but not physically accurate.

I'm not sure a physically accurate pelican would reach two pedals on a common bicycle. Maybe a model can solve that problem one day.


Accessibility is a broad umbrella of features that enable a ton of really cool stuff for everybody, not just the disabled. Things like agentic computer use is only possible because of "accessibility".

The same accessibility stuff that makes screen readers work well also makes automated UI tests simpler and less brittle too (correct aria roles, accessible names, label relationships etc).

Accessibility is the only way we have access to any settings on the iPhone

Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: