Zstd's decoder is very very fast for what it does, what ideas do you have to imp...

anonymoushn · on Nov 4, 2022

I expect any "perform a conversion of some sort on a byte stream" implementation that uses 0 SIMD instructions and is not memory bound is leaving a lot of performance on the table, especially if one is permitted to mess with the design and layout of the input to make it more amenable to vectorized consumption. I cannot confidently claim that we're missing a a >2x speedup in this case though, it may be as low as ~1.3x or something.

mappu · on Nov 4, 2022

I checked the zstd source and i'm surprised to see you're right -

There's a little x86_64 assembly but I don't see a single SIMD instruction anywhere, and no intrinsics neither. Seems like brotli is the same. I assume zstd still gains something from SIMD autovectorization in the compiler, that might be interesting to benchmark with- and without- such a flag.

Since the zstd bytestream got frozen in RFC8478, messing with the layout too much will require a zstd2 and moving the whole world again to use it (linux kernel compression, rpm binary format, etc)