For what it's worth, clang[0] compiles to the vpminu[bwd] + vpcmpeq[bwd] version even from manually written intrinsics of the addition-based version, and has since clang 7. Though, interestingly enough, it fails to cancel out a following xor.
(and, for completeness, a regular C loop[1], with some massaging to make it more readable)
That's unfortunate. The solution with min actually does equal work in this case--it just has a lesser startup cost, hence my mention of a 'hostile environment'--but with greater span, because the high-bit toggling can be done for both inputs in parallel. If you switch to a loop, at the very least, it ought to use the traditional solution, but it doesn't. Also it doesn't use the trick with the saturating subtract. https://godbolt.org/z/76qKs49eW
> simply add 128 (or however much—depending on the sizes at hand) to each of your inputs before comparing
On many CPUs, bitwise XOR is slightly more efficient than addition. But you still need the magic number.
> and you are in a hostile environment, you will have to figure out how to load up a vector of 128s, which costs cycles
That particular vector can be generated with 2 instructions without RAM access, pcmpeqd to generate a vector with all bits set, and psllw/pslld for shifts.
Modern compilers support LTCG/LTO which optimizes code across translation units.
If you have a loop comparing these vectors, the magic vector is likely to be created outside, and kept in a register.
> To avoid this, use min
Yeah, but if you need to compare for a < b as opposed to a <= b, you gonna need couple more instructions.
> if anyone working on superoptimisation has caught these?
Yep, some constants are better to be generated than loaded, see the docs from Mr. Agner.
On a more global scale, all assembly "optimisations"/tricks are hidden deep into compilers, which are reasonably "transparent" to their devs only, that due to their abysmal complexity and size.
We would need some sort of online library for those assembly (boolean/branchless calculus...) tricks.
A job for wikipedia? Maybe linked to the maths/boolean calculus?
If you're interested in this, Hacker's Delight has several hundreds of pages of these tricks. The book picks up from a 1972 MIT memo called HAKMEM. For a database/web guy like me, reading it is just recreational, but that's probably true for most people here. I'm currently about halfway through one hundred pages on how to divide. :-)
I sometimes write optimized low-level code, but many of these books, and internet articles, are rather old. CPU architectures have converted to just two, AMD64 and ARM64. They have tons of useful instructions like sign extension and integer division.
They also have less trivial instructions equivalent to some of these tricks, only faster. Couple examples.
To turn off the rightmost 1-bit in a word, on AMD64 there’s BLSR instruction from BMI1 set.
ARM CPUs have a fast instruction to reverse bits. AMD64 do not, but they have a fast instruction to flip order of bytes allowing to reverse these bits faster than what’s in that book.
These tricks are only useful very rarely. For instance, SIMD instructions can’t divide vectors of integers. Some older GPUs can’t divide FP64 numbers but most of them can multiply FP64, and all of them can divide FP32. For these exotic use cases, these tricks are still relevant.
If you’re not dividing greater than 2^23 numbers then can’t you just use the FP divide ? I’m sure I’ve done this in CUDA many years back.. Worked a treat :)
If you are then you have my sympathies..
That’s often a good way, but not necessarily the best one. On my computer with Zen3 CPU, cvtdq2ps and cvttps2dq have 3 cycles of latency each, divps up to 10 cycles, results in 16 CPU cycles in total.
When the divisor’s the same for all lanes and known at compile time, there’re tricks to reduce division to a single integer multiplication, and a few extra cheap instructions like shifts and additions. Usually faster than 16 cycles. Here’s an example, automagically made by clang 13: https://godbolt.org/z/T4TKb7oov
github.com/google/highway is arguably such a collection of tricks, the x86_256-inl.h header does emulate some missing AVX-512 instructions such as this one or compressstoreu.
(In this case we currently use a constant.)
> if you need to compare for a < b as opposed to a <= b, you gonna need couple more instructions
Depends on what you are doing--you may not need to spend any extra ops at all. For instance, if you are blending then you can compute a >= b and commute the blend arguments. If you are branching, then you pick the complementary flag to branch on. Etc.
That's pretty interesting. So AMD has split traditional ALU to different execution units? I mean, usually same ALU(s) that got an adder has also logical ops.
Apparently 3 out of 4 units can add integers, but all 4 of them can do bitwise operations. Despite the ISA defines multiple equivalent bitwise instructions like pand / andps / andpd, apparently modern AMD CPUs don’t have a bypass delay so these instructions are 100% equivalent on these CPUs. And with some luck (when not bottlenecked on instruction fetch, decode, or other place in the pipeline) each Zen2 core can run 4 of them every cycle.
I would like to read more about superoptimizers targeting AVX2, since previously I've complained that gcc will not make transformations between various equivalent instruction sequences for moving bytes around within and between registers, and some of these are a lot slower than others.
(and, for completeness, a regular C loop[1], with some massaging to make it more readable)
[0]: https://godbolt.org/z/e7TE5P73Y
[1]: https://godbolt.org/z/xhj3WTnxv