Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

RL isn't in general going to bias the model to use a lot of words, unless your RL training had the goal of favoring long over short responses.

There are now multiple levels of RL being used to post-train these models, from RLHF (use RL to bias the model to generate outputs matching human feedback preferences) to RL used to improve reasoning by generating reasoning steps that lead to verified correct conclusions (in areas like math and programming where correctness can be verified).

RLHF (not RL in general) may lead to longer more verbose outputs to the extent that human testers indicated longer responses as their preference. Maybe testers are easily bullshitted and like something that is longer and sounds like a more comprehensive authoritative answer ?

There is also the fact that an LLM is, unless prompted otherwise, is trying to predict Mr. Average (entire training set), who is more likely to waffle on than an expert who will cut to the chase and just give the facts, which they have a firm grip of. You can of course prompt the model to behave like an expert, or any given role, or to be more concise, which may or may not result in better output. It's a bit like asking the model to summarize when it's not really summarizing but instead predicting what a summary would look like (form vs function).



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: