It's true that I don't go into detail about double descent, though I do describe how increasing capacity often reduces overfitting.
I believe the figure labeled "Figure 1" illustrates what your are suggesting (despite being labeled Figure 1, it is actually at the bottom of the blog post, so maybe easy to miss).
> It's true that I don't go into detail about double descent, though I do describe how increasing capacity often reduces overfitting.
I agree.
> I believe the figure labeled "Figure 1" illustrates what your are suggesting (despite being labeled Figure 1, it is actually at the bottom of the blog post, so maybe easy to miss).
Easy to miss, yes. I'm not sure it illustrates the phenomenon, though. That plot shows extreme overfitting (i.e., interpolation) by the 10,000 parameter model. No one really understands what actually happens after interpolation. There's in fact some anecdotal evidence that after crossing the interpolation threshold, large AI models trained with SGD gradually begin to ignore outliers and find simpler models (!) that generalize better (!). Counterintuitive, I know. This is an active area of research, with no good explanations yet, AFAIK.
Double descent phenomenon is what happens after interpolation.
--
RESPONDING TO YOUR LAST COMMENT (after reaching thread depth limit):
Think of it this way: Why and how does the model's performance continue to improve on previously unseen samples after the model has fully overfit (interpolated between) all training samples? Interpolation is not the end-point in training, but a temporary threshold after which models learn to generalize better, improving on interpolation. How is it that these models improve on interpolation?
I can't reply directly -- is there a maximum thread depth, or a maximum conversation depth?
Anyway -- I wanted to apologize for misreading -- I missed the parenthetical "interpolation" in your comment. I think we are both interpreting the plot the same way.
In terms of your comment about anecdotal evidence -- are you talking about the case where data and model size are increased jointly? If so, I agree, though I don't think that is any longer cleanly to do with double descent/overparameterization.
I believe the figure labeled "Figure 1" illustrates what your are suggesting (despite being labeled Figure 1, it is actually at the bottom of the blog post, so maybe easy to miss).