Figure 13 right panel also shows there isn't a y=x relationship on out-of-sample...

famouswaffles · on Oct 31, 2024

>Figure 13 right panel also shows there isn't a y=x relationship on out-of-sample tests.

A y=x relationship is not necessary for meaningful correlation and the abstract is quite clear on out of sample performance either way.

>Second, in those corners it is not clear to me what the relationship is but it looks flattish (i.e. if the ground truth is ~0 then the model-guess-for-truth could be anywhere from 0 to 0.5).

The upper bound for guess-for-truth is not as important as the frequency. Yes it could guess 0.5 for 0 but how often compared to reasonable numbers? A test set on TriviaQA could well be thousands of questions.

>edit: By the way there are people working on a re-sampling algorithm based on the entropy and variance of the output logits called entropix

I know about entropix. It hinges strongly on the model's representations. If it works, then choosing to call it "knowing" or not is just semantics.

foobarqux · on Oct 31, 2024

> A y=x relationship is not necessary for meaningful correlation

I’m not concerned with correlation (which may or may not indicate an actual relationship) per se, I’m concerned with whether there is a meaningful relationship between predicted and actual. The 12 plot clearly shows that predicted isn’t tracking actual even in the corners. I think one of the lines (predicting 0% but actual is like 40%, going from memory on my phone) of Figure 13 right even more clearly shows there isn’t a meaningful relationship. In any case the authors haven’t made any argument about how those plots support their arguments and I don’t think you can either.

> the abstract is quite clear on out of sample performance either way.

Yes I’m saying the abstract is not supported by the results. You might as well say the title is very clear.

> The upper bound for guess-for-truth is not as important as the frequency. Yes it could guess 0.5 for 0 but how often compared to reasonable numbers? A test set on TriviaQA could well be thousands of questions.

Now we’ve gone from “the paper shows” to speculating about what the paper might have shown (and even that is probably not possible based on the Figure 13 line I described above)

> choosing to call it "knowing" or not is just semantics.

Yes it’s semantics but that implies it’s meaningless to use the term instead of actual underlying properties.

famouswaffles · on Nov 2, 2024

I think the lines of the largest model show a meaningful relationship.

The abstract also quite literally states that models struggle with out of distribution tests so again, what is the contradiction here ?

Would it have been hard to simply say you found the results unconvincing? There is nothing contradictory in the paper.

foobarqux · on Nov 2, 2024

For the red Lambada line in Fig 13 when the model predicts ~0 the ground truth is 0.7. No one can look at that line and say there is a meaningful relationship. The Py Func Synthesis line also doesn't look good above 0.3-0.4.

> The abstract also quite literally states that models struggle with out of distribution tests so again, what is the contradiction here ?

Out of distribution is the only test that matters. If it doesn't work out of distribution it doesn't work. Surely you know that.

> Would it have been hard to simply say you found the results unconvincing?

Anyone can look at the graphs, especially Figure 13, and see this isn't a matter of opinion.

> There is nothing contradictory in the paper.

The results contradict the claim the titular claim that "Language Models (Mostly) Know What They Know".

famouswaffles · on Nov 2, 2024

>For the red Lambada line in Fig 13 when the model predicts ~0 the ground truth is 0.7. No one can look at that line and say there is a meaningful relationship. The Py Func Synthesis line also doesn't look good above 0.3-0.4.

Yeah but Lambada is not the only line there.

>Out of distribution is the only test that matters. If it doesn't work out of distribution it doesn't work. Surely you know that.

Train the classifier on math questions and get good calibration for math, train the classifier on true/false questions and get good calibration for true/false, train the train the classifier on math but struggle with true/false (and vice versa). This is what "out-of-distribution" is referring to here.

Make no mistake, the fact that both the first two work is evidence that models encode some knowledge about the truthfulness of their responses. If they didn't, it wouldn't work at all. Statistics is not magic and gradient descent won't bring order where there is none.

What out of distribution "failure" here indicates is that "truth" is multifaceted and situation dependent and interpreting the models features is very difficult. You can't train a "general LLM lie detector" but that doesn't mean model features are unable to provide insight into whether a response is true or not.

foobarqux · on Nov 3, 2024

> Well good thing Lambada is not the only line there.

There are 3 out-of-distribution lines, all of them bad. I explicitly described two of them. Moreover, it seems like the worst time for your uncertainty indicator to silently fail is when you are out of distribution.

But okay, forget about out-of-distribution and go back to Figure 12 which is in-distribution. What relationship are you supposed to take away from the left panel? From what I understand they were trying to train a y=x relationship but as I said previously the plot doesn't show that.

An even bigger problem might be the way the "ground truth" probability is calculated: they sample the model 30 times and take the percentage of correct results as ground truth probability, but it's really fishy to say that the "ground truth" is something that is partly an internal property of the model sampler and not of objective/external fact. I don't have more time to think about this but something is off about it.

All this to say that reading long scientific papers is difficult and time-consuming and let's be honest, you were not posting these links because you've spent hours poring over these papers and understood them, you posted them because the headlines support a world-view you like. As someone else noted you can find good papers that have opposite-concluding headlines (like the work of rao2z).