Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I just tried to get Gemini to produce an image of a dog with 5 legs to test this out, and it really struggled with that. It either made a normal dog, or turned the tail into a weird appendage.

Then I asked both Gemini and Grok to count the legs, both kept saying 4.

Gemini just refused to consider it was actually wrong.

Grok seemed to have an existential crisis when I told it it was wrong, becoming convinced that I had given it an elaborate riddle. After thinking for an additional 2.5 minutes, it concluded: "Oh, I see now—upon closer inspection, this is that famous optical illusion photo of a "headless" dog. It's actually a three-legged dog (due to an amputation), with its head turned all the way back to lick its side, which creates the bizarre perspective making it look decapitated at first glance. So, you're right; the dog has 3 legs."

You're right, this is a good test. Right when I'm starting to feel LLMs are intelligent.





Draw a millipede as a dog:

Gemini responds:

Conceptualizing the "Millipup"

https://gemini.google.com/share/b6b8c11bd32f

Draw the five legs of a dog as if the body is a pentagon

https://gemini.google.com/share/d74d9f5b4fa4

And animal legs are quite standardized

https://en.wikipedia.org/wiki/List_of_animals_by_number_of_l...

It's all about the prompt. Example:

Can you imagine a dog with five legs?

https://gemini.google.com/share/2dab67661d0e

And generally, the issue sits between the computer and the chair.

;-)


An interesting test in this vein that I read about in a comment on here is generating a 13 hour clock—I tried just about every prompting trick and clever strategy I could come up with across many image models with no success. I think there's so much training data of 12 hour clocks that just clobbers the instructions entirely. It'll make a regular clock that skips from 11 to 13, or a regular clock with a plaque saying "13 hour clock" underneath, but I haven't gotten an actual 13 hour clock yet.

Right you are. It can do 26 hours just fine, but appears completely incapable when the layout would be too close to a normal clock.

https://gemini.google.com/share/b3b68deaa6e6

I thought giving it a setting would help, but just skip that first response to see what I mean.


"just fine" is not really an accurate description of that 26-hour clock

That's a 24 hour clock that skips some numbers and puts other numbers out of order.

It was ugly. But I got ChatGPT to cheat and do it

https://chatgpt.com/share/6933c848-a254-8010-adb5-8f736bdc70...

This is the SVG it created.

https://imgur.com/a/LLpw8YK


If you want to see something rather amusing - instead of using the LLM aspect of Gemini 3.0 Pro, feed a five-legged dog directly into Nano Banana Pro and give it an editing task that requires an intrinsic understanding of the unusual anatomy.

  Place sneakers on all of its legs.
It'll get this correct a surprising number of times (tested with BFL Flux2 Pro, and NB Pro).

https://imgur.com/a/wXQskhL


Does this still work if you give it a pre-existing many-legged animal image, instead of first prompting it to add an extra leg and then prompting it to put the sneakers on all the legs?

I'm wondering if it may only expect the additional leg because you literally just told it to add said additional leg. It would just need to remember your previous instruction and its previous action, rather than to correctly identify the number of legs directly from the image.

I'll also note that photos of dogs with shoes on is definitely something it has been trained on, albeit presumably more often dog booties than human sneakers.

Can you make it place the sneakers incorrectly-on-purpose? "Place the sneakers on all the dog's knees?"


My example was unclear. Each of those images on Imgur was generated using independent API calls which means there was no "rolling context/memory".

In other words:

1. Took a personal image of my dog Lily

2. Had NB Pro add a fifth leg using the Gemini API

3. Downloaded image

4. Sent image to BFL Flux2 Pro via the BFL API with the prompt "Place sneakers on all the legs of this animal".

5. Sent image to NB Pro via Gemini API with the prompt "Place sneakers on all the legs of this animal".

So not only was there zero "continual context", it was two entirely different models as well to cover my bases.

EDIT: Added images to the Imgur for the following prompts:

- Place red Dixie solo cups on the ends of every foot on the animal

- Draw a red circle around all the feet on the animal


i imagine the real answer is that the edits are local because that's how diffusion works; it's not like it's turning the input into "five-legged dog" and then generating a five-legged dog in shoes from scratch

I had no trouble getting it to generate an image of a five-legged dog first try, but I really was surprised at how badly it failed in telling me the number of legs when I asked it in a new context, showing it that image. It wrote a long defense of its reasoning and when pressed, made up demonstrably false excuses of why it might be getting the wrong answer while still maintaining the wrong answer.

Yeah it gave me the 5-legged dog on the 4th or 5th try.

Its not that they aren’t intelligent its that they have been RL’d like crazy to not do that

Its rather like as humans we are RL’d like crazy to be grossed out if we view a picture of a handsome man and beautiful woman kissing (after we are told they are brother and sister) -

Ie we all have trained biases - that we are told to follow and trained on - human art is about subverting those expectations


Why should I assume that a failure that looks like a model just doing fairly simple pattern matching "this is dog, dogs don't have 5 legs, anything else is irrelevant" vs more sophisticated feature counting of a concrete instance of an entity is RL vs just a prediction failure due to training data not containing a 5-legged dog and an inability to go outside-of-distribution?

RL has been used extensively in other areas - such as coding - to improve model behavior on out-of-distribution stuff, so I'm somewhat skeptical of handwaving away a critique of a model's sophistication by saying here it's RL's fault that it isn't doing well out-of-distribution.

If we don't start from a position of anthropomorphizing the model into a "reasoning" entity (and instead have our prior be "it is a black box that has been extensively trained to try to mimic logical reasoning") then the result seems to be "here is a case where it can't mimic reasoning well", which seems like a very realistic conclusion.


I have the same problem, people are trying so badly to come up with reasoning for it when there's just nothing like that there. It was trained on it and it finds stuff it was trained to find, if you go out of the training it gets lost, we expect it to get lost.

I’m inclined to buy the RL story, since the image gen “deep dream” models of ~10 years ago would produce dogs with TRILLIONS of eyes: https://doorofperception.com/2015/10/google-deep-dream-incep...

That's apples to oranges; your link says they made it exaggerate features on purpose.

"The researchers feed a picture into the artificial neural network, asking it to recognise a feature of it, and modify the picture to emphasise the feature it recognises. That modified picture is then fed back into the network, which is again tasked to recognise features and emphasise them, and so on. Eventually, the feedback loop modifies the picture beyond all recognition."


"There are four lights"

And the AI has been RLed for tens of thousands of years not just a few days.


My guess is the part of its neural network that parses the image into a higher level internal representation really is seeing the dog as having four legs, and intelligence and reasoning in the rest of the network isn't going to undo that. It's like asking people whether "the dress" is blue/black or white/gold: people will just insist on what they see, even if what they're seeing is wrong.

Isn't this proof that LLMs still don't really generalize beyond their training data?

LLMs are very good at generalizing beyond their training (or context) data. Normally when they do this we call it hallucination.

Only now we do A LOT of reinforcement learning afterwards to severely punish this behavior for subjective eternities. Then act surprised when the resulting models are hesitant to venture outside their training data.


Hallucination are not generalization beyond the training data but interpolations gone wrong.

LLMs are in fact good at generalizing beyond their training set, if they wouldn’t generalize at all we would call that over-fitting, and that is not good either. What we are talking about here is simply a bias and I suspect biases like these are simply a limitation of the technology. Some of them we can get rid of, but—like almost all statistical modelling—some biases will always remain.


What, may I ask, is the difference between "generalization" and "interpolation"? As far as I can tell, the two are exactly the same thing.

In which case the only way I can read your point is that hallucinations are specifically incorrect generalizations. In which case, sure if that's how you want to define it. I don't think it's a very useful definition though, nor one that is universally agreed upon.

I would say a hallucination is any inference that goes beyond the compressed training data represented in the model weights + context. Sometimes these inferences are correct, and yes we don't usually call that hallucination. But from a technical perspective they are the same -- the only difference is the external validity of the inference, which may or may not be knowable.

Biases in the training data are a very important, but unrelated issue.


Interpolation and generalization are two completely different constructs. Interpolation is when you have two data points and make a best guess where a hypothetical third point should fit between them. Generalization is when you have a distribution which describes a particular sample, and you apply it with some transformation (e.g. a margin of error, a confidence interval, p-value, etc.) to a population the sample is representative of.

Interpolation is a much narrower construct then generalization. LLMs are fundamentally much closer to curve fitting (where interpolation is king) then they are to hypothesis testing (where samples are used to describe populations), though they certainly do something akin to the latter to.

The bias I am talking about is not a bias in the training data, but bias in the curve fitting, probably because of mal-adjusted weights, parameters, etc. And since there are billions of them, I am very skeptical they can all be adjusted correctly.


I assumed you were speaking by analogy, as LLMs do not work by interpolation, or anything resembling that. Diffusion models, maybe you can make that argument. But GPT-derived inference is fundamentally different. It works via model building and next token prediction, which is not interpolative.

As for bias, I don’t see the distinction you are making. Biases in the training data produce biases in the weights. That’s where the biases come from: over-fitting (or sometimes, correct fitting) of the training data. You don’t end up with biases at random.


> It works via model building and next token prediction, which is not interpolative.

I'm not particularly well-versed in LLMs, but isn't there a step in there somewhere (latent space?) where you effectively interpolate in some high-dimensional space?


What I meant was that what LLMs are doing is very similar to curve fitting, so I think it is not wrong to call it interpolation (curve fitting is a type of interpolation, but not all interpolation is curve fitting).

As for bias, sampling bias is only one many types of biases. I mean the UNIX program YES(1) has a bias towards outputting the string y despite not sampling any data. You can very easily and deliberately program a bias into everything you like. I am writing a kanji learning program using SSR and I deliberately bias new cards towards the end of the review queue to help users with long review queues empty it quicker. There is no data which causes that bias, just program it in there.

I don‘t know enough about diffusion models to know how biases can arise, but with unsupervised learning (even though sampling bias is indeed very common) you can get a bias because you are using wrong, mal-adjusted, to many parameters, etc. even the way your data interacts during training can cause a bias, heck even by random one of your parameters hits an unfortunate local maxima yielding a mal-adjusted weight, which may cause bias in your output.


I wonder how they would behave given a system prompt that asserts "dogs may have more or less than four legs".

That may work but what actual use would it be? You would be plugging one of a million holes. A general solution is needed.

They do, but we call it "hallucination" when that happens.

Kind of feels that way

> starting to feel LLMs are intelligent

LLMs are fancy “lorem ipsum based on a keyword” text generators. They can never become intelligent … or learn how to count or do math without the help of tools.

It can probably generate a story about a 5 legged dog though.


I feel a weird mix of extreme amusement and anger that there's a fleet of absurdly powerful, power-hungry servers sitting somewhere being used to process this problem for 2.5 minutes

Do 7 legged dog. Game over.

LLMs are getting a lot better at understanding our world by standard rules. As it does so, maybe it losses something in the way of interpreting non standard rules, aka creativity.

It's not obvious to me whether we should count these errors as failures of intelligence or failures of perception. There's at least a loose analogy to optical illusion, which can fool humans quite consistently. Now you might say that a human can usually figure out what's going on and correctly identify the illusion, but we have the luxury of moving our eyes around the image and taking it in over time, while the model's perception is limited to a fixed set of unchanging tokens. Maybe this is relevant.

(Note I'm not saying that you can't find examples of failures of intelligence. I'm just questioning whether this specific test is an example of one).


I am having trouble understanding the distinction you’re trying to make here. The computer has the same pixel information that humans do and can spend its time analyzing it in any way it wants. My four-year-old can count the legs of the dog (and then say “that’s silly!”), whereas LLMs have an existential crisis because five-legged-dogs aren’t sufficiently represented in the training data. I guess you can call that perception if you want, but I’m comfortable saying that my kid is smarter than LLMs when it comes to this specific exercise.

Your kid, it should be noted, has a massively bigger brain than the LLM. I think the surprising thing here maybe isn't that the vision models don't work well in corner cases but that they work at all.

Also my bet would be that video capable models are better at this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: