I think the approximate reciprocal approach is interesting here. The doc mention...

I think the approximate reciprocal approach is interesting here. The doc mentions multiplying the dividend by ~1.00025 in the math to avoid FP error so you don’t end up off-by-one after truncation, but I think this hack is still incomplete! On some inputs (like 255, or other unlucky divisors near powers of two), you might get borderline rounding behaviour that flips a bit of the final integer. It’s easy to forget that single-precision floats don’t line up neatly with every 8bit integer ratio in real code, and a single off-by-one can break pixel ops or feed subtle bugs into a bigger pipeline.

I suspect a hybrid scheme like using approximate reciprocals for most values but punting to scalar for unlucky ones could handle these corner cases without killing performance. That’d be interesting to benchmark