Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Understanding what kind of tasks LLMs can and cannot reliably solve remains incredibly difficult and unintuitive.

Case in point: the other day my daughter was doing a presentation and she said "Dad can you help me find a picture of the word HELLO spelled out in vegetables?"

I was like "CAN I!!?!?! This sounds like a job for ChatGPT".

I'll tell you what: ChatGPT can give you a picture of a cat wearing a space suit drinking a martini but it definitely cannot give you the word HELLO spelled out in vegetables.

I ended up getting it to give me each individual letter of the alphabet constructed with vegetables and she pasted them together to make the words she wanted for her presentation.



It's always funny to read these stories if you know how ChatGPT actually works. Because if you know about tokenization, you know why this is definitely not a good job for ChatGPT. Exactly the same reason why it can't spell STRAWBERRY. Not because it doesn't understand the concept of fruits or vegetables or because it doesn't understand sophisticated concepts like metaphors or memes. It's not a good job for it because it doesn't see text in the way you see it. You see the world "hello" made up of individual characters, but the model sees it as a single token (the token with id 24912 for gpt-4o to be precise). It knows the meaning of this token and it's relationship to other tokens much the same way you know relations between words. But it has fundamentally no clue about the characters that make up this word (unless someone trained it to do so or by using spurious additional relations that might exist in the training data).


> but the model sees it as a single token (the token with id 24912 for gpt-4o to be precise). It knows the meaning of this token and it's relationship to other tokens much the same way you know relations between words

In this context, if we assume that Deep Thought from Hitchhiker's Guide is an LLM, then the answer to everything[1] i.e. 42 makes sense. 42 is just the token id !

1. https://en.m.wikipedia.org/wiki/Phrases_from_The_Hitchhiker%...


Oh my god, you cracked it. After all these years..


> But it has fundamentally no clue about the characters that make up this word (unless someone trained it to do so or by using spurious additional relations that might exist in the training data).

That was my theory as well when I first saw the strawberry test. However, it is easy test if they know how to spell.

The most obvious is:

> Can you spell "It is wonderful weather outside. I should go out and play.". Use capital letters, and separate each letter with a space.

The free tier ChatGPT model is smart enough to understand the following instructions as well which shows that its not just the simple words:

> I was wondering if you can spell. When I ask you a question, answer me with capital letters, and separate each word with a space. When there is real space between the letters, insert character '--' there, so the output is easier to read. Tell me how the attention mechanism works in the modern transformer language models.

Also somebody pointed out in some other HN thread that the modern LLMs are perfect for dyslexic people, because you can typo every single word and the model still understands you perfectly. Not sure how true this is, but at least a simple example seems to work:

> Hlelo, how aer you diong. Cna you undrestnad me?

It would be interesting to know if the datasets actually include spelling examples, or if the models learn how to spell form the massive amount of spelling mistakes in the datasets.


They can do this kind of thing, but in my experience, that makes the model feel "dumber" as far as quality of output goes (unless you make it produce normal output first before having it convert it to something else).


I wonder if there's research being done on training LLMs with extended data in analogy to the "kernel trick" for SVMs: the same way one might feed (x, x^2, x^3) rather than just x, and thus make a linear model able to reason about a nonlinear boundary, should we be feeding multimodal LLMs with not only a token-by-token but also character-by-character and pixel-by-pixel representation of prompt texts during training and inference? Or, allow them to "circle back" and request they be given that as subsequent input text, if they detect that it's relevant information for them? There's likely a lot of interesting work here.


> if you know about tokenization, you know why this is definitely not a good job for ChatGPT

Why are other LLMs able to do it? (Other comments show images successfully generated with grok and flux.1)


You can train the model to do these things in the same way you can train a blind person to describe the colors of objects. But if you put them in an unknown environment or give them a texture they've never encountered before, they will have no idea how to perceive its color. This is a fundamental problem for LLMs and won't change until someone invents a method that gets rid of tokenization for good.


This reminds me of how the alien brain species in Futurama went about gathering all the facts in the Universe.

"Beavers mate for life, 11 > 4"


> the token with id 24912 for gpt-4o to be precise

How do you find this out?


GPT-4o uses BPE (byte pair encoding). They released `tiktoken` which allows you to count tokens in strings in python.

    pip install tiktoken
    >>> import tiktoken
    >>> encoding = tiktoken.encoding_for_model("gpt-4o")
    >>> print(encoding.encode("hello marcellus"))
    [24912, 2674, 10936, 385]


OpenAI provides a tokenizer tool: https://platform.openai.com/tokenizer


To hazard a silly question...

Why can't gpt pick up a non-fundamental "understanding" of letters and spelling from the data?

I mean... I do think "see letters" either when speaking/hearing, but I do know how to pull up those letters when necessary.


You can do that with flux.1, it's the best image model right now as far as I'm aware, especially for dealing with text.

This is the result for "vegetables spelling out the word "HELLO"" I used flux-pro on Replicate https://ibb.co/1RVKmdk


Wow, it used very intersting vegetables, the very common E shapped cucumber and of course, the commonplace O shapped Tomatoe with a void in the middle. Very usual vegetables, can buy them in any grocery store.


Sometimes I wonder if anything can actually impress anyone on this site.

Words have plagued image gen since the start. Now there is an image model that, with an extremely simple prompt, does an awesome job with words.

If they expanded their prompt and played with a few seeds until an image with perfectly realistic vegetables were generated, I wonder what the next complaint would be.


People are threatened. They are not going to celebrate the thing that makes them irrelevant. They are going to talk shit about it and downplay it.


im not sure if threatened is correct, i think it's more a collective "why?" And there hasn't been a particuarly convincing answer


Right, tons of gen AI stuff is super impressive. It's really cool that we can do all this stuff. In this thread we're talking about a fun toy.

But the actual practical applications are like, small useful tools. There's no real sign we're heading for a world-changing trillion dollar industry.


When I see the prompt "the word HELLO spelled out in vegetables", I expect realistic vegetables being assembled into the appropriate shapes, such that e.g. the O is made of many different vegetables arranged in a circle.

I don't expect imaginary nightmare vegetables.


Nice! I haven't had a go at flux yet but I'll keep that in my back pocket for the next time I need to spell a word using vegetables.


It got the letters right but none of the vegetables it used exist on this planet.



Yeah, and notice how the emphasis on the word "unusual".



Grok did OK? (Grok 2 Mini Beta)

https://i.imgur.com/mvnusFd.jpeg


What kind of vegetable is the first 'l' made out of? It's like tiny chicken nuggets.


Expired cauliflower.


Looks a bit like cloudberries but not really. More like fried chicken


Grok just uses Flux, fyi


Not bad! I just tried with 4o and ChatGPT is still failing hard.

Have I just inadvertently invented a new benchmark!?


No. Had you asked for it to spell "strawberry", though...


> Not bad!

Well, it's not a great example of spelling "HELLO"...


Hey we would have been happy with a case insensitive but otherwise correct answer ...


I just tried it with ChatGPT 4o and it seemed to do a good enough job with the first prompt I tried (which I copied from your comment).

https://chatgpt.com/share/66f530b0-3fb8-800a-8af9-8a3e48a31a...


I can’t see the pick in the shared link but when I tried with 4o it gave me HIIILO

EDIT: maybe it’s influenced by my custom instructions and memories. I write code all day with it and I have custom instructions specifically to get the type of output I like for code, mostly focused on brevity.


funny enough, grok was able to generate one https://i.imgur.com/qt89arC.png


I think they're trying to spell "HELP".


At one point, asking ChatGPT to output SVG spelling out "HELLO" would pretty consistently produce something like "LOL".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: