Hacker Newsnew | past | comments | ask | show | jobs | submit | Reubend's commentslogin

After trying to understand their method, I think you're right. Doesn't seem like anything that I would personally call "diffusion". Much closer to MTP + speculative decoding.

Then again, their results with it are great. It would be interesting to benchmark it against standard SD on a model that already uses MTP.


Yeah, I think it's a super neat way to do MTP. Conceptually much more pleasing and simple than existing methods. Especially since this way scaling `k` as models get better will be easier. Wish it had been presented as such.

This page is buggy for me and doesn't show any plans. But when I go to their main pricing page, it's got some contradictory info about which plans include "Kimi Code". The $19 per month plan says that it comes with "Kimi Code available" but then shows an "X" near Kimi Code while the others have labels like "1x credits". So I guess they meant to say that it doesn't have it?

I got confused by that too.

The top table starts with the first paid model, but the second table starts with a free plan. So the X for Kimi Code is on the free plan, not the Moderato plan at the top that says Kimi Code available.


Because the website doesn't seem to show any sample size of runs, I assume they ran it once across the suite.

The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.

I don't see this as evidence that Opus 4.6 has gotten worse.


> The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.

And how is that an excuse?

I don't care about how good a model could be. I care about how good a model was on my run.

Consequently, my opinion on a model is going to be based around its worst performance, not its best.

As such, this qualifies as strong evidence that Opus 4.6 has gotten worse.


>> The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.

> And how is that an excuse? […] this qualifies as strong evidence…

This qualifies as nothing due to how random processes work, that’s what the gp is saying. The numbers are not reliable if it’s just one run.

If this is counter-intuitive, a refresher on basic statistics and probability theory may be in order.


> If this is counter-intuitive, a refresher on basic statistics and probability theory may be in order.

I'm not running "statistics". I'm running an individual run. I care about the individual quality of my run and not the general quality of the "aggregate".

The problem here is that the difference may not be immediately observable. Sure, if it doesn't give a correct answer, that's quickly catchable. If it costs me 10x the time, that's not immediately catchable but no less problematic.


No, what they're saying is the previous run could have just been lucky and not representative!

I would love to know what you’re doing in the harness to not feel the total degradation in experience now in comparison to December & January.

>I don't see this as evidence that Opus 4.6 has gotten worse.

I see it as corroboration evidence of actual everyday experience.

Also, any reason to imply "BridgeBench", apparently dedicated to AI benchmarking, wouldn't have run it more than once across the suite?


> Also, any reason to imply "BridgeBench", apparently dedicated to AI benchmarking, wouldn't have run it more than once across the suite?

They didn't list a sample size of runs, didn't show any numbers for variance across runs, etc...

So while they may have done that behind the scenes and just not told us, this doesn't seem like a rigorous analysis to me. It seems to me like people just want to find data that support the conclusion they already decided on (which is that Opus got worse).


are models really non deterministic?

People are describing the results when they say models are non-deterministic. Give it the same exact input twice, and you'll get two different outputs. Deterministic would mean the same input always gives the same output.

Yes. Look up LLM "temperature" - it's an internal parameter that tweaks how deterministic they behave.

The models are deterministic, the inference is not.

Which is a useless distinction. When we say models in this context we mean the whole LLM + infrastructure to serve it (including caches, etc).

What does that even mean?

Even then, depending on the specific implementation, associativity of floating point could be an issue between batch sizes, between exactly how KV cache is implemented, etc.


That's still an inference time issue. If you have perfect inference with a zero temperature, the models are deterministic. There is no intrinsic randomness in software-only computing.

Floating point associativity differences can lead to non-determinism with 0 temperature if the order of operations are non-deterministic.

Anyone with reasonable experience with GPU computation who pays attention knows that even randomness in warp completion times can easy lead to non-determinism due to associativity differences.

For instance: https://www.twosigma.com/articles/a-workaround-for-non-deter...

It is very well known that CUDA isn't strongly deterministic due to these factors among practitioners.

Differences in batch sizes of inference compound these issues.

Edit: to be more specific, the non-determinism mostly comes from map-reduce style operations, where the map is deterministic, but the order that items are sent to the reduce steps (or how elements are arranged in the tree for a tree reduce) can be non-deterministic.


My point is, your inference process is the non-deterministic part; not the model itself.

Eh., if you have a PyTorch model that uses non-deterministic tensor operations like matrix multiplications, I think it is fair to call the model non-deterministic, since the matmul is not guaranteed to be deterministic - the non determinism of a matmul isn't a bug but a feature.

See e.g.https://discuss.pytorch.org/t/why-is-torch-mm-non-determinis...


Cool experiment! But the "CEO" agent picked the most boring possible items to sell: t-shirts and some bland art prints designed by AI. I would have loved to see more creativity given that they could have picked anything.

Not surprised actually. TBH this is the biggest gap in the “AI is can make you a website”, the aesthetics are always so boring and bland, or often just fugly (bad colour matching, inappropriate paddings and margins, etc). And the logos it generates are similarly boring. As can be seen from the smiley face logo here. What does this store sell? A sparse layout as designed in a high rent location typically sells very expensive, very niche products that you can’t get anywhere else. This seems to me like it has already failed.

Agreed. I assume the products were decided upon based on market research of the area. Maybe though the model will be able to iterate and adapt faster than a human CEO would? I guess we will just have to wait and see

It looks like every "lifestyle" company / brand I've been seeing come out of Millenials/Genz . Next up it will offer "coaching" on IG or some similar play where it promises to fix your life without having fixed its own.

I expect earlier iterations successfully circumvented local regulations and created high street bookies

AI is the best thing that happened to America in the last decade, and I dearly hope that politicians don't try to ruin it the way they're ruining other parts of the country.

I respect some of Bernie's positions, but his stance on tech is dangerous.


AI is the best thing that happened for a relatively few number of privileged people, most of them being located in USA. This includes the programmers that may have benefited from using AI assistants.

AI has already stolen great amounts of money from a very large number of people all around the world, due to the huge increases in the prices of DRAM, SSDs and HDDs.

Moreover, there have already been a great number of layoffs, which truly or not have been blamed on AI. AI may have not been the true reason for those, but it certainly has provided a convenient justification.

There is no doubt that the number of people who have already been harmed by AI greatly exceeds the number of people who have benefited from AI.

There is no reason to believe that this trend will not continue.

When used in the right way, there is no doubt that LLMs and other ML/AI tools can ensure a significant progress, but from the recent history it seems almost certain that they will be more often used in the wrong way than in the right way, so most people will be negatively affected, not positively.

The problem is not AI itself, but the fact that AI is a tool controlled by extremely evil people, e.g. Sam Altman and Larry Ellison. It is very unlikely that this will change and this is the reason why AI will do more harm than good.

(There are a lot of examples that prove that individuals like those that I have named are truly evil, but I will just quote from TFA: "Larry Ellison predicts an AI-powered surveillance state in which “citizens will be on their best behavior, because we’re constantly recording and reporting everything that is going on”". Even only this is enough to prove that Ellison is an enemy.)


> AI has already stolen great amounts of money from a very large number of people all around the world, due to the huge increases in the prices of DRAM, SSDs and HDDs.

None of that is stealing. That's the free market.


Well it's not like AI investment money is coming out of thin air - ultimately normies are buying stuff that AI enhanced companies produce which allows those companies to feed this money to the actual winners/1% that truly benefits. I don't think that it's fair to say that AI has been only negative to the normies, when they need to be the ones willingly feeding the beast with their money for the whole thing to perpetuate.

That said I obviously also see a bunch of negative consequences, and perhaps agree that the negatives outweigh the positives.


To America as a whole? How?

AI is certainly powerful, but despite tech CEO whitewashing, none of them are planning for how the economy will recover from a potential devastation of white collar jobs. Token bills fund rich investors & executives, not everyday Americans.

For AI to give me abundant free time & happiness, I need to have money, and I don’t see UBI anywhere on OpenAI’s roadmap.


> I don’t see UBI anywhere on OpenAI’s roadmap.

Do you really think it's OpenAI's job to create UBI? Surely, if you feel that it's a good idea, then it should be the government who sets it up.

We can't just magically freeze the economy in time. If we conduct our industries inefficiently just to keep jobs around, we won't be competitive on a global market.

I have strong concerns around the inflationary effects of UBI, but whatever the solution there is, it's not the responsibility of private companies to organize their own welfare systems.


For those of us who watched FB play the “move fast and break things” card, and are now watching the predicted effects of that play out, we think people like YOU are dangerous, and we respect people like Bernie for trying to pump the brakes (knowing the last 10 years have been downhill).

"Legibility" must be the wrong word because I can't understand what the author is talking about. Is he saying that the overuse of abstractions is ruining corporate culture? Or is he saying that the uniformity of corporate processes is becoming overbearing?

I think this needs to be re-written with different terminology.

"...you don't care about code review"

Code review is one of the things I care about most! In fact, now what we're in an age where many code changes are generated by LLMs, I think that code reviews are far more important than they used to be.

"Processes adopted by a company aren't about the end result of the work. Their stated goals and their actual goals are always distinct."

Wrong again. Often the goal is exactly as stated, for better or for worse. Let's take incident reviews as an example: their goal is to reduce the occurrence of emergency incidents in the future by learning from the mistakes that led to incidents in the past. There's no doublethink involved.


"Legibility" here just means something that is visibly important/useful/scrutable/not a waste of time to (i.e. "is legible to") a manager.

I would suggest that people stop overfocusing on benchmarks, and give this a try. Gemma 4 is performing really well for me, and seems to hallucinate much less than other models I tried in this size range.


This is pretty impressive. I think This sort of thing is a perfect fit for agentic coding because of the fact that you can compare the generated assembly afterwards as a safeguard/test. Plus, even if the code is messy, you can always ask the LLMs to do some cleanup passes afterwards.


You're absolutely right. And these Intel GPUs will also be much faster in terms of actual math than the M series GPUs that the Apple setup would have.


This looks great, but I'm wondering how effective this would be for full model weights rather than just the KV cache. Their paper only gives results for the KV cache use case, which strikes me as strange since the algos are claimed to be near optimal.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: