Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I sometimes write optimized low-level code, but many of these books, and internet articles, are rather old. CPU architectures have converted to just two, AMD64 and ARM64. They have tons of useful instructions like sign extension and integer division.

They also have less trivial instructions equivalent to some of these tricks, only faster. Couple examples.

To turn off the rightmost 1-bit in a word, on AMD64 there’s BLSR instruction from BMI1 set.

ARM CPUs have a fast instruction to reverse bits. AMD64 do not, but they have a fast instruction to flip order of bytes allowing to reverse these bits faster than what’s in that book.

These tricks are only useful very rarely. For instance, SIMD instructions can’t divide vectors of integers. Some older GPUs can’t divide FP64 numbers but most of them can multiply FP64, and all of them can divide FP32. For these exotic use cases, these tricks are still relevant.



If you’re not dividing greater than 2^23 numbers then can’t you just use the FP divide ? I’m sure I’ve done this in CUDA many years back.. Worked a treat :) If you are then you have my sympathies..


That’s often a good way, but not necessarily the best one. On my computer with Zen3 CPU, cvtdq2ps and cvttps2dq have 3 cycles of latency each, divps up to 10 cycles, results in 16 CPU cycles in total.

When the divisor’s the same for all lanes and known at compile time, there’re tricks to reduce division to a single integer multiplication, and a few extra cheap instructions like shifts and additions. Usually faster than 16 cycles. Here’s an example, automagically made by clang 13: https://godbolt.org/z/T4TKb7oov




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: