Figure 13 right panel also shows there isn't a y=x relationship on out-of-sample tests.
First we agree by observation that outside of the top-right and bottom-left corners there isn't any meaningful relationship in the data, regardless of what the numerical value of the correlation is. Second, in those corners it is not clear to me what the relationship is but it looks flattish (i.e. if the ground truth is ~0 then the model-guess-for-truth could be anywhere from 0 to 0.5). This is also consistent with the general behavior displayed in figure 13.
If you have some other interpretation of the data you should lay it out. The authors certainly did not do that.
edit:
By the way there are people working on a re-sampling algorithm based on the entropy and variance of the output logits called entropix: if the output probabilities for the next token are spread evenly for example (and not have overwhelming probability for a single token) they prompt for additional clarification. They don't really claim anything like the model "knows" whether it's wrong but they say it improves performance.
>Figure 13 right panel also shows there isn't a y=x relationship on out-of-sample tests.
A y=x relationship is not necessary for meaningful correlation and the abstract is quite clear on out of sample performance either way.
>Second, in those corners it is not clear to me what the relationship is but it looks flattish (i.e. if the ground truth is ~0 then the model-guess-for-truth could be anywhere from 0 to 0.5).
The upper bound for guess-for-truth is not as important as the frequency. Yes it could guess 0.5 for 0 but how often compared to reasonable numbers? A test set on TriviaQA could well be thousands of questions.
>edit: By the way there are people working on a re-sampling algorithm based on the entropy and variance of the output logits called entropix
I know about entropix. It hinges strongly on the model's representations. If it works, then choosing to call it "knowing" or not is just semantics.
> A y=x relationship is not necessary for meaningful correlation
I’m not concerned with correlation (which may or may not indicate an actual relationship) per se, I’m concerned with whether there is a meaningful relationship between predicted and actual. The 12 plot clearly shows that predicted isn’t tracking actual even in the corners. I think one of the lines (predicting 0% but actual is like 40%, going from memory on my phone) of Figure 13 right even more clearly shows there isn’t a meaningful relationship. In any case the authors haven’t made any argument about how those plots support their arguments and I don’t think you can either.
> the abstract is quite clear on out of sample performance either way.
Yes I’m saying the abstract is not supported by the results. You might as well say the title is very clear.
> The upper bound for guess-for-truth is not as important as the frequency. Yes it could guess 0.5 for 0 but how often compared to reasonable numbers? A test set on TriviaQA could well be thousands of questions.
Now we’ve gone from “the paper shows” to speculating about what the paper might have shown (and even that is probably not possible based on the Figure 13 line I described above)
> choosing to call it "knowing" or not is just semantics.
Yes it’s semantics but that implies it’s meaningless to use the term instead of actual underlying properties.
For the red Lambada line in Fig 13 when the model predicts ~0 the ground truth is 0.7. No one can look at that line and say there is a meaningful relationship. The Py Func Synthesis line also doesn't look good above 0.3-0.4.
> The abstract also quite literally states that models struggle with out of distribution tests so again, what is the contradiction here ?
Out of distribution is the only test that matters. If it doesn't work out of distribution it doesn't work. Surely you know that.
> Would it have been hard to simply say you found the results unconvincing?
Anyone can look at the graphs, especially Figure 13, and see this isn't a matter of opinion.
> There is nothing contradictory in the paper.
The results contradict the claim the titular claim that "Language Models (Mostly) Know What They Know".
>For the red Lambada line in Fig 13 when the model predicts ~0 the ground truth is 0.7. No one can look at that line and say there is a meaningful relationship. The Py Func Synthesis line also doesn't look good above 0.3-0.4.
Yeah but Lambada is not the only line there.
>Out of distribution is the only test that matters. If it doesn't work out of distribution it doesn't work. Surely you know that.
Train the classifier on math questions and get good calibration for math, train the classifier on true/false questions and get good calibration for true/false, train the train the classifier on math but struggle with true/false (and vice versa). This is what "out-of-distribution" is referring to here.
Make no mistake, the fact that both the first two work is evidence that models encode some knowledge about the truthfulness of their responses. If they didn't, it wouldn't work at all. Statistics is not magic and gradient descent won't bring order where there is none.
What out of distribution "failure" here indicates is that "truth" is multifaceted and situation dependent and interpreting the models features is very difficult. You can't train a "general LLM lie detector" but that doesn't mean model features are unable to provide insight into whether a response is true or not.
> Well good thing Lambada is not the only line there.
There are 3 out-of-distribution lines, all of them bad. I explicitly described two of them. Moreover, it seems like the worst time for your uncertainty indicator to silently fail is when you are out of distribution.
But okay, forget about out-of-distribution and go back to Figure 12 which is in-distribution. What relationship are you supposed to take away from the left panel? From what I understand they were trying to train a y=x relationship but as I said previously the plot doesn't show that.
An even bigger problem might be the way the "ground truth" probability is calculated: they sample the model 30 times and take the percentage of correct results as ground truth probability, but it's really fishy to say that the "ground truth" is something that is partly an internal property of the model sampler and not of objective/external fact. I don't have more time to think about this but something is off about it.
All this to say that reading long scientific papers is difficult and time-consuming and let's be honest, you were not posting these links because you've spent hours poring over these papers and understood them, you posted them because the headlines support a world-view you like. As someone else noted you can find good papers that have opposite-concluding headlines (like the work of rao2z).
First we agree by observation that outside of the top-right and bottom-left corners there isn't any meaningful relationship in the data, regardless of what the numerical value of the correlation is. Second, in those corners it is not clear to me what the relationship is but it looks flattish (i.e. if the ground truth is ~0 then the model-guess-for-truth could be anywhere from 0 to 0.5). This is also consistent with the general behavior displayed in figure 13.
If you have some other interpretation of the data you should lay it out. The authors certainly did not do that.
edit: By the way there are people working on a re-sampling algorithm based on the entropy and variance of the output logits called entropix: if the output probabilities for the next token are spread evenly for example (and not have overwhelming probability for a single token) they prompt for additional clarification. They don't really claim anything like the model "knows" whether it's wrong but they say it improves performance.