The most interesting thing about diffusion LMs that tends to be missed, are their ability to edit early tokens.
We know that the early tokens in an autoregressive sequence disproportionately bias the outcome. I would go as far as to say this is some of the magic of reasoning models is they generate so much text they can kinda get around this.
However, diffusion seems like a much better way to solve this problem.
The model is trained to encourage re-evaluating the soundness of tokens produced during the "thinking phase".
The model state vector is kept in a state of open exploration. Influenced by the already emitted tokens but less strongly so.
The non-reasoning models were just trained with the goal of producing useful output on a first try and they did their best to maximize that fitness function.
But how can test-time compute be implemented for diffusion models if they already operate on the whole text at once? Say it gets stuck—how does it proceed further? Autoregressive reasoning models would simply backtrack and try other approaches. It feels like denoising the whole text further wouldn't lead to good results, but I may be wrong.
Diffusion LLMs are still residual networks. You can Google that, but it means that they don't generate the whole text at once. Every layer generates corrections to be made to the whole text at once.
Think of it like writing a text by forcing your teacher to write for you by entering in the assignment 100 times. You begin by generating completely inaccurate text, almost random, that leans perhaps a little bit towards the answer. Then you systematically begin to correct small parts of the text. The teacher that sees the text, and uses red the red pen to correct a bunch of things. Then the corrected text is copied onto a fresh page, and resubmitted to the teacher. And again. And again. And again. And again. 50 times. 100 times. That's how diffusion models work.
Technically, it adds your corrections to the text, but that's mathematical addition, not adding at the end. Also technically every layer is a teacher that's slightly different from the previous one. And and and ... but this is the basic principle. The big advantage is that this makes neural networks slowly lean towards the answer. First they decide to have 3 sections, one about X, Y and one about Z, then they decide on what sentences to put, then they start thinking about individual words, then they start worrying about things like grammar, and finally about spelling and pronouns and ...
So to answer your question: diffusion networks can at any time decide to send out a correction that effectively erases the text (in several ways). So they can always start over by just correcting everything all at once back to randomness.
Yeah, but with autoregressive models, the state grows, whereas with diffusion models, it remains fixed. As a result, a diffusion model can't access its past thoughts (e.g., thoughts that rejected certain dead ends) and may start oscillating between the same subpar results if you keep denoising multiple times.
You try to define that it does. Here's my attempt: reasoning is nothing but a set of rules to follow. Complex reasoning is a set of rules that includes instructions on how to extend the set of rules in some way.
Image diffusion does both (if you include things like training and finetuning)
Bidirectional seq2seq models are usually more accurate than unidirectional models.
However, autoregressive models that generate one token at a time are usually more accurate than parallel models that generate multiple tokens at a time.
In diffusion LLMs, both of these two effects interact. You can trade them off by determining how many tokens are generated at a time, and how many future tokens are used to predict the next set of tokens.
I think the discussion here is confusing the algorithm for the output. It's true that diffusion can rewrite tokens during generation, but it is doing so for consistency with the evolving output -- not "accuracy". I'm unaware of any research which shows that the final product, when iteration stops, is less likely to contain hallucinations than with autoregression.
With that said, I'm still excited about diffusion -- if it offers different cost points, and different interaction modes with generated text, it will be useful.
The Llada paper: https://ml-gsai.github.io/LLaDA-demo/ here implied strong bidirectional reasoning capabilities and improved performance on reversal tasks (where the model needs to reason backwards).
I'm not sure about hallucination about facts, but it might be less prone to logically inconsistent statements of the form "the sky is red because[...] and that's why the sky is blue".
I'm personally happy to see effort in this space simply because I think it's an interesting set of tradeoffs (compute ∝ accuracy) - a departure from the fixed next token compute budget required now.
It brings up interesting questions, like what's the equivalency between smaller diffusion models which consume more compute because they have a greater number of diffusion steps compared to larger traditional LLMs which essentially have a single step. How effective is decoupling the context window size to the diffusion window size? Is there an optimum ratio?
Was it on Hacker News a few days ago? There was a diffusion language model that was actually running, but I think it was a paid service. I don't know if anybody mentioned that there was a open source one or one that you could run locally.
Considering that the article links back to this post, the simplest explanation might be that the author changed the title at some point. If this were a larger publication, I would have probably assumed an A/B test
There is a disproportionate skepticism in autoregressive models and a disproportionate optimism in alternative paradigms because of the absolutely non verifiable idea that LLMs, when predicting the next token, don't already model, in the activation states, the gist of what they could going to say, similar to what humans do. That's funny because many times it can be observed in the output of truly high quality replies that the first tokens only made sense in the perspective of what comes later.
maybe i understand this a little differently, the argument i am most familiar with is this one from lecun, where the error accumulation in the prediction is the concern with autoregression https://drive.google.com/file/d/1BU5bV3X5w65DwSMapKcsr0ZvrMR...
The error accumulation thing is basically without any ground as regressive models correct what they are saying in the process of emitting tokens (trivial to test yourself: force a given continuation in the prompt and the LLMs will not follow at all). LeCun provided an incredible amount of wrong claims about LLMs, many of which he now no longer accepts: like the stochastic parrot claim. Now the idea that there is just a statistical relationship in the next token prediction is considered laughable, but even when it was formulated there were obvious empirical hints.
i think the opposite, the error accumulation thing is basically the daily experience of using LLMs.
As for the premise that models cant self correct that's not the argument i've ever seen, transformers have global attention across the context window. It's that their prediction abilities are increasingly poor as generation goes on. Is anyone having a different experience than that?
Everyone doing some form of "prompt engineering" whether with optimized ML tuning, whether with a human in the loop, or some kind of agentic fine tuning step, runs into perplexity errors that get worse with longer contexts in my opinion.
There's some "sweet spot" for how long of a prompt to use for many use cases, for example. It's clear to me that less is more a lot of the time
Now will diffusion fare significantly better on error is another question. Intuition would guide me to think more flexiblity with token-rewriting should enable much greater error correction capabilities. Ultimately as different approaches come online we'll get PPL comparables and the data will speak for itself
I'm not talking about the fine tuning that make them side with the user even when they are wrong (anyway, this is less and less common now compared to the past, but anyway it's a different effect). I'm referring if in the template you make the assistant reply starting with wrong words / directions, and the LLM finds a way to say what it really meant saying "wait, actually I was wrong" or other sentences that allow it to avoid following the line.
Is it possible that combining multiple AIs will be able to somewhat bypass scaling laws, in a similar way that multicore CPUs can somewhat bypass the limitations of a single CPU core?
I read a wikipedia article of a person who was very intelligent but also suffered from a mental illness. He told people around him that his next novel will be of exactly N number of words and it will end with the sentence P.
I don't remember article. I read it a decade ago. It's like he was doing diffusion in his mind, subconsciously perhaps
The animation on the page looks an awful lot like autoregressive inference in that virtually all of the tokens are predicted in order? But I guess it doesn't have to do that in the general case?
The example in the linked demo[0] seems less left-to-right.
Anyway, I think we'd expect it to usually be more-or-less left-to-right -- We usually decide what to write or speak left-to-right, too, and we don't seem to suffer much for it.
(Unrelated: it's funny that the example generated code has a variable "my array" with a space in it.)
So, in practice there are some limitations here. Chat interfaces force you to feed the entire context to the model everytime you ping it. Even multi step tool calls have a similar thing going.
So, yeah we may effectively turn all of this effectively into autoregressive models too.
That got me thinking that it would be nice to have something like ComfyUi to work with diffusion based LLMs. Apply LORAs, use multiple inputs, have multiple outputs.
Something akin to ComfyUi but for LLMs would open up a world of possibilities.
ComfyUI already has nodes (mostly in extensions available through the built in manager) for working with LLMs, both remote LLMs accessed through APIs and local ones running under Comfy itself, the same as it runs other models.
Maybe not even 'akin' but literally ComfyUI. Comfy already has a bunch of image-to-text nodes. I haven't seen txt2txt or Loras and such for them though. But I also haven't looked.
It's complicated by the ComfyUI data model, which treats strings as immediate values/constants and not variables in their own right. This could ostensibly be fixed/worked around, but I imagine that it would come at a cost to backwards compatibility.
Just looking at all of the amazing tools and workflows that people have made with ComfyUI and stuff makes me wonder what we could do with diffusion LMs. It seems diffusion models are much more easily hackable than LLMs.
How do diffusion LLMs decide how long the output should be? Normal LLMs generate a stop token and then halt. Do diffusion LLMs just output a fixed block of tokens and truncate the output that comes after a stop token?
I guess the biggest limitation of this approach is that the max output length is fixed before generation starts. Unlike autoregressive LLM, which can keep generating forever.
I don't know why (and am curious) but this particularly odd question phrasing seems to happen a lot among Indian immigrants I've met in America. Maybe it's considered grammatically correct in India or something?
I've seen an explanation (that I don't fully buy), that school teachers end most sentences with a question because they're trying to get the children? the children? to complete? their sentence.
See also this recent post about Mercury-Coder from Inception Labs. There's a "diffusion effect" toggle for their chat interface but I have no idea if that's an accurate representation of the model's diffusion process or just some randomly generated characters showing what the diffusion process looks like
I know the r-word is coming back in vogue, but it was still unpleasant to see it in the middle of an otherwise technical blog post. Ah well.
Diffusion LMs are interesting and I'm looking forward to seeing how they develop, but from playing around with that model, it's GPT-2 level. I suspect it will need to be significantly scaled up before we can meaningfully compare it to the autoregressive paradigm.
Retarded is too good of a word to go unused. It feels super wrong to call a mentally disabled person retarded or a retard. And we're told we can't call stupid things retarded. So who gets to use it? No one?
With gay, on the other hand, gay people call each other gay and are usually okay being labeled as gay. So, it's still in use, and I think it's fine to push back against using it to mean "lame" or whatever.
Finally, you should keep in mind that the author may not be American or familiar with American social trends. "Retarded" might be just fine in South Africa or Australia(I don't know). Similar to how very few Americans would bat an eye at someone using the phrase "spaz out", whereas it is viewed as very offensive in England.
If you have a burning urge to use "retarded" with complete dick-o-matic immunity, try a sentence like, "the flame retardant chemical successfully retarded the spread of the fire". You may singe a few eyebrows, that's about it.
Might seem like a descriptive word but the fact is, it's hurtful to people who are working harder to make their way in life than I'll ever have to. Even when just heard in passing.
Why do things in life that will hurt someone who'll likely just retreat away rather than confront you. Be the good guy.
That's the euphemism treadmill though, isn't it? "Retard" literally means late or delayed (hence French: en retard). Back when it was originally introduced to refer to a handicap, it was chosen for that reason to be a kind, polite, and indirect phrasing. That will also be the fate of any new terms that we choose. Hence for example in physics the term retarded potential (https://en.wikipedia.org/wiki/Retarded_potential) was chosen to refer to the delaying effect of the speed of light on electromagnetic fields, before the word had any association with mental disability.
Words don't need to retain intrinsic hurtfulness; their hurtfulness comes from their usage, and the hurtful intent with which they are spoken. We don't need to yield those words to make them the property of 1990s schoolyard bullies in perpetual ownership.
To that extent I'd still say this article's usage is not great.
> Words don't need to retain intrinsic hurtfulness; their hurtfulness comes from their usage, and the hurtful intent with which they are spoken.
Yes; and a rose by any other name would smell as sweet.
Words don't need to retain intrinsic hurtfulness, but it's not quite right that the hurtfulness comes from the usage either. The hurtfulness comes from the actual referent, combined with intent.
If I tell someone they are idiotic, imbecilic, moronic, mentally retarded, mentally handicapped, mentally challenged, I am merely iterating through a historical list of words and phrases used to describe the same real thing in the world. The hurt fundamentally comes from describing someone of sound mind as if they are not. We all know that we don't want to have a cognitive disability, given a choice, nor to be thought as if we had.
The euphemism treadmill tries to pretend that the referent isn't an undignified position to be in. But because it fundamentally is, no matter what words are used, they can still be used to insult.
Any word used to describe intellectual disability would be just as hurtful, at least when given enough time to enter the vernacular. That's just how language and society works. Children especially can call each other anything and make it offensive, because bullying and cliquish behavior is very natural and it's hard to train actual politeness and empathy into people in authoritarian environments like schools.
You're right, it's the intent that matters. <any_word>, used to describe something stupid or negative while also being an outdated description for a specific group of people...
The fact is, it's _that_ word that's evolved into something hurtful. So rather than be the guy who sticks up for the_word and try convince everyone it shouldn't be hurtful, I just decided to stop using it. The reason why I stopped was seeing first hand how it affected someone with Down Syndrome who heard me saying it. Sometimes real life beats theoretical debate. It's something I still feel shame about nearly 20 years later.
It wasn't a particularly onerous decision to stop using it, or one that opened the floodgate of other words to be 'banned'. And if someone uses it and hasn't realized that, then move on - just avoid using it next time. Not a big deal. It's the obnoxious, purposefully hurtful use of it that's not great (which doesn't seem to be the case here tbh). It's the intent that matters more.
For anyone else confused, this "r-word" is "retarded".
They're not talking about a human. To me that makes it feel very different.
However, there's also a large component coming from the current political situation. People feel more confident to push back against things like the policing of word usage. They're less likely to get "cancelled" now. They feel more confident that the zeitgeist is on their side now. They're probably right.
Eh, I'm as left as they come and I'm tired of pretending that banning words solve anything. Who's offended? Why? Do you have a group of retarded friends you hang out with on the regular? Are they reading the article? No and no. Let's not pretend that changing the term to differentently abled or whatever has any meaning. It doesn't. It's a handful of loud people (usually well off white women) on social media dictating what is and isn't ok. Phrases like "temporarily unhoused" rather than homeless is another good way to pretend to be taking action when you're doing less than nothing. Fight for policy, not changing words.
> I'm as left as they come and I'm tired of pretending that banning words solve anything. Who's offended? Why?
I'm with you on this, also speaking as a strong leftist.
I do think that "banning" , or at least strongly condemning, the use of words when the specific group being slurred are clear that they consider it a slur and want it to stop is reasonable. But not when it's social justice warriors getting offended on behalf of other people.
However, I think it's absolutely ridiculous that even when discussing the banning of these words, we're not allowed to use them directly. We are supposed to say "n-word", "r-word" even when discussing in an academic sense. Utter nonsense, it's as if saying these words out loud would conjure a demon.
The point of these meaningless dictionary changes isn't to solve anything. It's to give plausible deniability to asshole behaviour through virtue signalling.
Crazy assholes will argue along the lines that it is an insignificant inconvenience and hence anyone who uses the old language must use it maliciously and on purpose, because they are ableist, racist or whatever.
This then gives assholes the justification to behave like a biggot towards the allegedly ableist person. The goal is to dress up your own abusive bullying as virtuous, even though deep down you don't actually care about disabled people.
This is an interesting take, and I think it's not unreasonable to label the worst of the social justice warriors as assholes.
However, most of them are well meaning. They're misguided rather than assholes. They really do want to take action for social improvement. It's just that real change is too hard and requires messy things like protesting on the street or getting involved in politics and law. So, they fall back on things like policing words, or calling out perceived bad actors, which they can do from the comfort of their homes via the internet.
To be fair, some genuinely bad people have been "cancelled". The "me too" movement didn't happen without reason. It's just that it went too far, and started ignoring pesky things like evidence, or innocent until proven otherwise.
Yes and yes? I’m an AI enthusiast interested in the article and I’m offended by that word for pretty non-hypothetical reasons. When I was in middle school I was bullied a lot by people who would repeatedly call me the r-slur. That word reminds me of some of the most shameful and humiliating moments of my life. If I hear someone use it out of nowhere it makes me wince. Seeing it written down isn’t as bad, but I definitely would prefer people phased it out of their repertoire.
I would also like the whole world to change so that I don't have to face my personal traumas. But, since that's not gonna happen, I have to deal with them in other ways.
CBT therapy, especially the "B" part, was essentially created to help overcome phobias of things like words. There are some great books on CBT, and also research showing that working alone using a book can often be as effective as working with a therapist. The classic Feeling Good by David Burns, while the case studies are a bit dated, is still an amazing book.
I honestly forgot how much the word affected me until recently because for about 20 years I almost never heard anyone use it. So apparently it is in fact possible for people to by and large stop using it.
We know that the early tokens in an autoregressive sequence disproportionately bias the outcome. I would go as far as to say this is some of the magic of reasoning models is they generate so much text they can kinda get around this.
However, diffusion seems like a much better way to solve this problem.