Unsigned comparisons in AVX2/SSE: a quick note

dzaima · on Aug 26, 2022

For what it's worth, clang[0] compiles to the vpminu[bwd] + vpcmpeq[bwd] version even from manually written intrinsics of the addition-based version, and has since clang 7. Though, interestingly enough, it fails to cancel out a following xor.

(and, for completeness, a regular C loop[1], with some massaging to make it more readable)

[0]: https://godbolt.org/z/e7TE5P73Y

[1]: https://godbolt.org/z/xhj3WTnxv

moonchild · on Aug 26, 2022

That's unfortunate. The solution with min actually does equal work in this case--it just has a lesser startup cost, hence my mention of a 'hostile environment'--but with greater span, because the high-bit toggling can be done for both inputs in parallel. If you switch to a loop, at the very least, it ought to use the traditional solution, but it doesn't. Also it doesn't use the trick with the saturating subtract. https://godbolt.org/z/76qKs49eW

Const-me · on Aug 26, 2022

> simply add 128 (or however much—depending on the sizes at hand) to each of your inputs before comparing

On many CPUs, bitwise XOR is slightly more efficient than addition. But you still need the magic number.

> and you are in a hostile environment, you will have to figure out how to load up a vector of 128s, which costs cycles

That particular vector can be generated with 2 instructions without RAM access, pcmpeqd to generate a vector with all bits set, and psllw/pslld for shifts.

Modern compilers support LTCG/LTO which optimizes code across translation units. If you have a loop comparing these vectors, the magic vector is likely to be created outside, and kept in a register.

> To avoid this, use min

Yeah, but if you need to compare for a < b as opposed to a <= b, you gonna need couple more instructions.

> if anyone working on superoptimisation has caught these?

Page #17 there: http://const.me/articles/simd/simd.pdf

anonymoushn · on Aug 26, 2022

> Yeah, but if you need to compare for a < b as opposed to a <= b, you gonna need couple more instructions.

In general yes. Depending on your use for the mask, it might be fine to compute !(a < b) == (b <= a) and use the mask with andnot.

sylware · on Aug 26, 2022

Yep, some constants are better to be generated than loaded, see the docs from Mr. Agner.

On a more global scale, all assembly "optimisations"/tricks are hidden deep into compilers, which are reasonably "transparent" to their devs only, that due to their abysmal complexity and size.

We would need some sort of online library for those assembly (boolean/branchless calculus...) tricks.

A job for wikipedia? Maybe linked to the maths/boolean calculus?

pjungwir · on Aug 26, 2022

If you're interested in this, Hacker's Delight has several hundreds of pages of these tricks. The book picks up from a 1972 MIT memo called HAKMEM. For a database/web guy like me, reading it is just recreational, but that's probably true for most people here. I'm currently about halfway through one hundred pages on how to divide. :-)

Const-me · on Aug 26, 2022

I sometimes write optimized low-level code, but many of these books, and internet articles, are rather old. CPU architectures have converted to just two, AMD64 and ARM64. They have tons of useful instructions like sign extension and integer division.

They also have less trivial instructions equivalent to some of these tricks, only faster. Couple examples.

To turn off the rightmost 1-bit in a word, on AMD64 there’s BLSR instruction from BMI1 set.

ARM CPUs have a fast instruction to reverse bits. AMD64 do not, but they have a fast instruction to flip order of bytes allowing to reverse these bits faster than what’s in that book.

These tricks are only useful very rarely. For instance, SIMD instructions can’t divide vectors of integers. Some older GPUs can’t divide FP64 numbers but most of them can multiply FP64, and all of them can divide FP32. For these exotic use cases, these tricks are still relevant.

muziq · on Aug 26, 2022

If you’re not dividing greater than 2^23 numbers then can’t you just use the FP divide ? I’m sure I’ve done this in CUDA many years back.. Worked a treat :) If you are then you have my sympathies..

Const-me · on Aug 26, 2022

That’s often a good way, but not necessarily the best one. On my computer with Zen3 CPU, cvtdq2ps and cvttps2dq have 3 cycles of latency each, divps up to 10 cycles, results in 16 CPU cycles in total.

When the divisor’s the same for all lanes and known at compile time, there’re tricks to reduce division to a single integer multiplication, and a few extra cheap instructions like shifts and additions. Usually faster than 16 cycles. Here’s an example, automagically made by clang 13: https://godbolt.org/z/T4TKb7oov

Const-me · on Aug 26, 2022

I think the compiler tricks are reasonably transparent to any developer who knows both C and assembly, thanks to godbolt.org.

Couple times I even back-ported these tricks from clang-generated assembly back into C++ intrinsics.

sylware · on Aug 27, 2022

godbolt is not really noscript/basic (x)html friendly... or I missed such www portal.

janwas · on Aug 26, 2022

github.com/google/highway is arguably such a collection of tricks, the x86_256-inl.h header does emulate some missing AVX-512 instructions such as this one or compressstoreu. (In this case we currently use a constant.)

moonchild · on Aug 26, 2022

> if you need to compare for a < b as opposed to a <= b, you gonna need couple more instructions

Depends on what you are doing--you may not need to spend any extra ops at all. For instance, if you are blending then you can compute a >= b and commute the blend arguments. If you are branching, then you pick the complementary flag to branch on. Etc.

vardump · on Aug 26, 2022

> On many CPUs, bitwise XOR is slightly more efficient than addition.

That's pretty interesting, any examples of CPUs (or microcontrollers) where this happens?

Const-me · on Aug 26, 2022

An example is AMD Zen 2. The latency of both vpaddd and vpxor is 1 cycle, but throughput is different, 0.33 versus 0.25 cycles.

BTW, Zen 2 CPUs are used in both Xbox S/X, and PS5.

vardump · on Aug 26, 2022

That's pretty interesting. So AMD has split traditional ALU to different execution units? I mean, usually same ALU(s) that got an adder has also logical ops.

Const-me · on Aug 26, 2022

Zen 2 cores have 4 execution units which run arithmetical and logical SIMD instructions: https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Fl...

Apparently 3 out of 4 units can add integers, but all 4 of them can do bitwise operations. Despite the ISA defines multiple equivalent bitwise instructions like pand / andps / andpd, apparently modern AMD CPUs don’t have a bypass delay so these instructions are 100% equivalent on these CPUs. And with some luck (when not bottlenecked on instruction fetch, decode, or other place in the pipeline) each Zen2 core can run 4 of them every cycle.

anonymoushn · on Aug 26, 2022

I'm not working on superoptimization, but yes it does seem like eq(x,max_unsigned(x,y)) is an "obvious" cmpge_unsigned if one just browses all the intrinsics available here: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

I would like to read more about superoptimizers targeting AVX2, since previously I've complained that gcc will not make transformations between various equivalent instruction sequences for moving bytes around within and between registers, and some of these are a lot slower than others.

pyler · on Aug 26, 2022

What gcc/clang does?