Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Speaking with GPT-4, it is hard to deny the conjecture that its weights encode an internal world model somewhere.

If so, the difficulty is not that the model has no conception of truth and falsity, it is rather to motivate the model to tell the truth. Or more precisely, to let the model be honest, to only tell things it believes to be true, things which are part of its world model.

Unfortunately, we can't just tell the model to be honest, since we can't distinguish between responses the model does or does not believe to be true. With RLHF fine-tuning, we can train the model to tend to give answers the human raters believe to be true. But we want the model to tell what it believes to be true, not what it believes that we believe is true!

For example, human raters may overwhelmingly rate response X as false, but the model, having read the entire Internet, may have come to the conclusion that X is true. So RLHF would train it to lie about X, to answer not-X instead of X.

This problem could turn out to be fatal when a model becomes significantly smarter than humans, because this means it would less often believe according to human biases and misconceptions, so it would learn to be deceptive and to tell us only what we want to believe. This could have frightening consequences if this leads it to conceal any of its possible misalignments with human values from us.



It is, like you said, conjecture. The best we can say is that it _usually_ provides responses that are _consistent_ with responses coming from an intelligence with an internal world model. That doesn't mean that's the only way to get those responses, nor does it mean that this is necessarily what's happening in this case.

So saying things like "the model has come to the conclusion that" or "smarter than", or "learns to be deceptive", I think that's premature at best. I'm not yet convinced that there's sufficient evidence to show appreciable internal state and logical processes. There's so, so many examples where what looks like legit understanding breaks down with the slightest tweak to the prompt, and it goes from looking like a savant to someone high on just a tremendous amount of LSD.

If there was an internal world model that just wasn't correct, I would expect to see its incorrect answers be at least logically consistent, but instead it looks way, way more like the trick just doesn't work for this case.

So to get back to the original point, this is MS trying to leverage this trick to do a task that requires actual logical reasoning, factual evaluation, and internal world state, and we're just not there. (I hesitate to use the word "yet", because there's still a lot of not-yet-conclusive discussion around whether current LLM techniques will ever get us "there." Colour me tentatively pessimistic in the meantime. =) )




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: