I sometimes write optimized low-level code, but many of these books, and interne...

muziq · on Aug 26, 2022

If you’re not dividing greater than 2^23 numbers then can’t you just use the FP divide ? I’m sure I’ve done this in CUDA many years back.. Worked a treat :) If you are then you have my sympathies..

Const-me · on Aug 26, 2022

That’s often a good way, but not necessarily the best one. On my computer with Zen3 CPU, cvtdq2ps and cvttps2dq have 3 cycles of latency each, divps up to 10 cycles, results in 16 CPU cycles in total.

When the divisor’s the same for all lanes and known at compile time, there’re tricks to reduce division to a single integer multiplication, and a few extra cheap instructions like shifts and additions. Usually faster than 16 cycles. Here’s an example, automagically made by clang 13: https://godbolt.org/z/T4TKb7oov