> Understanding what kind of tasks LLMs can and cannot reliably solve remains incredibly difficult and unintuitive.
Case in point: the other day my daughter was doing a presentation and she said "Dad can you help me find a picture of the word HELLO spelled out in vegetables?"
I was like "CAN I!!?!?! This sounds like a job for ChatGPT".
I'll tell you what: ChatGPT can give you a picture of a cat wearing a space suit drinking a martini but it definitely cannot give you the word HELLO spelled out in vegetables.
I ended up getting it to give me each individual letter of the alphabet constructed with vegetables and she pasted them together to make the words she wanted for her presentation.
It's always funny to read these stories if you know how ChatGPT actually works. Because if you know about tokenization, you know why this is definitely not a good job for ChatGPT. Exactly the same reason why it can't spell STRAWBERRY. Not because it doesn't understand the concept of fruits or vegetables or because it doesn't understand sophisticated concepts like metaphors or memes. It's not a good job for it because it doesn't see text in the way you see it. You see the world "hello" made up of individual characters, but the model sees it as a single token (the token with id 24912 for gpt-4o to be precise). It knows the meaning of this token and it's relationship to other tokens much the same way you know relations between words. But it has fundamentally no clue about the characters that make up this word (unless someone trained it to do so or by using spurious additional relations that might exist in the training data).
> but the model sees it as a single token (the token with id 24912 for gpt-4o to be precise). It knows the meaning of this token and it's relationship to other tokens much the same way you know relations between words
In this context, if we assume that Deep Thought from Hitchhiker's Guide is an LLM, then the answer to everything[1] i.e. 42 makes sense. 42 is just the token id !
> But it has fundamentally no clue about the characters that make up this word (unless someone trained it to do so or by using spurious additional relations that might exist in the training data).
That was my theory as well when I first saw the strawberry test. However, it is easy test if they know how to spell.
The most obvious is:
> Can you spell "It is wonderful weather outside. I should go out and play.". Use capital letters, and separate each letter with a space.
The free tier ChatGPT model is smart enough to understand the following instructions as well which shows that its not just the simple words:
> I was wondering if you can spell. When I ask you a question, answer me with capital letters, and separate each word with a space. When there is real space between the letters, insert character '--' there, so the output is easier to read. Tell me how the attention mechanism works in the modern transformer language models.
Also somebody pointed out in some other HN thread that the modern LLMs are perfect for dyslexic people, because you can typo every single word and the model still understands you perfectly. Not sure how true this is, but at least a simple example seems to work:
> Hlelo, how aer you diong. Cna you undrestnad me?
It would be interesting to know if the datasets actually include spelling examples, or if the models learn how to spell form the massive amount of spelling mistakes in the datasets.
They can do this kind of thing, but in my experience, that makes the model feel "dumber" as far as quality of output goes (unless you make it produce normal output first before having it convert it to something else).
I wonder if there's research being done on training LLMs with extended data in analogy to the "kernel trick" for SVMs: the same way one might feed (x, x^2, x^3) rather than just x, and thus make a linear model able to reason about a nonlinear boundary, should we be feeding multimodal LLMs with not only a token-by-token but also character-by-character and pixel-by-pixel representation of prompt texts during training and inference? Or, allow them to "circle back" and request they be given that as subsequent input text, if they detect that it's relevant information for them? There's likely a lot of interesting work here.
You can train the model to do these things in the same way you can train a blind person to describe the colors of objects. But if you put them in an unknown environment or give them a texture they've never encountered before, they will have no idea how to perceive its color. This is a fundamental problem for LLMs and won't change until someone invents a method that gets rid of tokenization for good.
Wow, it used very intersting vegetables, the very common E shapped cucumber and of course, the commonplace O shapped Tomatoe with a void in the middle. Very usual vegetables, can buy them in any grocery store.
Sometimes I wonder if anything can actually impress anyone on this site.
Words have plagued image gen since the start. Now there is an image model that, with an extremely simple prompt, does an awesome job with words.
If they expanded their prompt and played with a few seeds until an image with perfectly realistic vegetables were generated, I wonder what the next complaint would be.
When I see the prompt "the word HELLO spelled out in vegetables", I expect realistic vegetables being assembled into the appropriate shapes, such that e.g. the O is made of many different vegetables arranged in a circle.
I can’t see the pick in the shared link but when I tried with 4o it gave me HIIILO
EDIT: maybe it’s influenced by my custom instructions and memories. I write code all day with it and I have custom instructions specifically to get the type of output I like for code, mostly focused on brevity.
Case in point: the other day my daughter was doing a presentation and she said "Dad can you help me find a picture of the word HELLO spelled out in vegetables?"
I was like "CAN I!!?!?! This sounds like a job for ChatGPT".
I'll tell you what: ChatGPT can give you a picture of a cat wearing a space suit drinking a martini but it definitely cannot give you the word HELLO spelled out in vegetables.
I ended up getting it to give me each individual letter of the alphabet constructed with vegetables and she pasted them together to make the words she wanted for her presentation.