More

MagicMoonlight · 2026-04-27T01:47:23 1777254443

This article says anthropic models can write out the entire benchmark solution set word for word from memory

MagicMoonlight · 2026-04-26T21:37:25 1777239445

Actually no, it will increase it. Because it’ll be trained with the deletion command as a valid output.

simonh · 2026-04-26T23:35:30 1777246530

Exactly. It’s just giving the LLM a token pattern, and it’s designed to reproduce token patterns. That’s all it does. At some point generating a token pattern like that again is literally it’s job.

nh2 · 2026-04-27T14:05:57 1777298757

Why would one set up reinforcement learning like that?

The point of creating samples from user data should surely be to label them good or bad, based on the whole conversation.

You look at what happened eventually, judge the outcome as bad, and thus train the "rm" token in the middle to be less likely.

simonh · 2026-04-27T14:15:32 1777299332

It is possible, but it requires specifically labelling the data. You have to craft question response pairs to label. But even then the result is only probabilistic.

The LLM in this case had been very thoroughly trained and instructed quite specifically not to do many of the things it actually then when off and did.

It may be that there's a kind of cascade effect going on here. Possibly once the LLM breaks one rule it's supposed to follow, this sets it off on a pattern of rule violations. After all what constitutes a rule violation is there in the training set, it is a type of token stream the LLM has been trained on. It could be the LLM switches into a kind of black hat mode once it's violated a protocol that leads it down a path of persistently violating protocols, and given the statistical model some violations of protocol are always possible.

My mother was a primary school teacher. She used to say that the worst thing you can say to a bunch of kind leaving class down the hall is "don't run in the hall". It puts it in their minds. You need to say "Please walk in the hall", then they'll do it.

MagicMoonlight · 2026-04-26T21:33:21 1777239201

Live by the slop, die by the slop. This is natural selection at work.

MagicMoonlight · 2026-04-24T10:40:20 1777027220

It makes sense. Grok is taught to answer the question, regardless of how explicit or extreme it is. These other models are taught to suppress any wrongthink. That's going to make it hard to answer things correctly. If you've been told to answer something incorrectly because it's wrong, then you'll have to make up an answer.

MagicMoonlight · 2026-04-24T09:17:30 1777022250

That’s not normal. It’s like saying “I drink 6-10 beers a day so 3-5 is very moderate”

MagicMoonlight · 2026-04-23T18:35:36 1776969336

Two hundred pages of shilling and it’s a 1% improvement in the benchmarks. They’re dead in the water.

Imagine spending 100m on some of these AI “geniuses” and this is the best they can do.

MagicMoonlight · 2026-04-23T07:12:06 1776928326

Well yes, because they needed high availability and flexibility and tons of features…

Hey wait a minute!

MagicMoonlight · 2026-04-23T07:10:19 1776928219

Because if I have a government service with millions of users, I don’t want the cheap shitter servers to crap out on me.

An employee is going to cost anywhere between 8k and 50k per month. Hiring an employee to save 200/month on servers by using a shitty VPS provider is not saving you any money.

kennywinker · 2026-04-23T07:34:51 1776929691

If you have millions of users, you absolutely need to have someone whose whole job is managing infrastructure. Expecting servers or cloud services to not crap out on you without someone with the skills and time to keep things running seems foolish.

MagicMoonlight · 2026-04-22T21:59:20 1776895160

No, because you’ll end up really dysfunctional if you’re a child in a class with 18 year olds.

MagicMoonlight · 2026-04-22T13:12:49 1776863569

They’re also some of the most strategically valuable companies if you are say an evil country that wants to build long range nuclear missiles or advance your space programme.