Regardless of whether the convergence is superficial or not, I am interested especially in what this could mean for future compression of weights. Quantization of models is currently very dumb (per my limited understanding). Could exploitable patterns make it smarter?
reply