Yours is the best response to my comment so far. I very much enjoy the observati...

Yours is the best response to my comment so far.

I very much enjoy the observation that LLM's appear to function optimally when trained on "tokens" and not the pure unfiltered stream of characters. I think I am ultimately attempting to express an analogous belief that the individual audio samples here are as meaningless as the individual letters are to an LLM.

Instead of "representation with the fewest assumptions" I would maybe suggest that the optimal input for a model may be the representation where the data is broken apart as far as it can be while still remaining meaningful. I have suggested in other replies that this is perhaps achieved with quadrature samples or even perhaps with something such as a granular decomposition -- something akin to a "token" of audio instead of language.