You actually issue the `futex` system call to get yourself on the wait queue tied to the memory address. It separates out the waiting from the locking.
And that can absolutely save a bunch of system calls, especially vs. polling mixed with `sleep()` or similar.
It does not, in fact the two are fundamentally inseparable and the state of the memory address must be treated atomically with the waiting state. The magic of futex is that you can use a hardware atomic operation (c.f. lock cmpxchg on x86) to get the lock in the common/uncontended case, but if you have to wait you need to tell the kernel both that you need to wait and the address on which you're waiting, so it can use the same hardware interlocks along with its own state locking to put you to sleep race-free.
It quite does; the kernel is not the keeper of the lock, it only needs to detect the race condition that would result in a spurious sleep. It cares not one bit about the rest of your semantics.
It's true you could use it that way, but it's not the way it's meant to be used, defeating the purpose by requiring a system call even for uncontended locks.
I think you're misunderstanding how futexes work, or else making an essentially irrelevant semantic argument around a definition for "keeper". The kernel is, 100%, absolutely, the "keeper" of that lock data for the duration of the system call. It knows that (hardware) address and matches it to any other such syscall from any other process on the system. And that requires tracking and intelligence and interaction with the VM system and arch layer up and down the stack.
It just doesn't "allocate" it on its own and lets the process use its own mapped memory. But to pretend that it doesn't have to do any work or that the memory is "separated" betrays some misunderstandings about what is actually happening here.
The kernel is responsible for maintaining the wait queues, and making sure that there is no race condition on the state that should preclude queueing.
It does not care how you use the queue, at all. It doesn't have to be done with a locking primitive, whatsoever. You absolutely can use the exact same mechanism to implement a thread pool with a set of dormant threads, for instance.
The state check in the basic futex is only done to avoid a race condition. None of the logic of preventing threads from entering critical sections is in the purview of the kernel, either. That's all application-level.
And most importantly, no real lock uses a futex for the locking parts. As mentioned in the article, typically a mutex will directly try to acquire the lock with an atomic operation, like an atomic fetch-and-or, fetch-and-add, or even compare-and-swap.
A single atomic op, even if you go for full sequential consistency (which comes w/ full pipeline stalls), is still a lot better than a trip into the kernel when you can avoid it.
Once again, I'm not saying you couldn't use the futex state check to decide what's locked and what's not. I'm saying nobody should, and it was never the intent.
The intent from the beginning was to separate out the locking from the waiting, and I think that's pretty clear in the original futex paper (linked to in my article).
I like to think of a futex as the simplest possible condition variable, where the predicate is just the state of the memory word (note that a mutex guarding the predicate is unnecessary since the word can be read and written atomically). It turns out that this is simple enough to implement efficiently in the kernel, yet expressive enough to implement pretty much any userspace synchronization primitive over it.
You are of course completely right. In fact sometimes I wish that the kernel would do slightly more with the memory location, like optionally reserving a bit to show the empty/non empty state of the queue: the kernel should be able to keep it up to date cheaply as part of the wait/wake operations while is more complicated for userspace.
This sort of thing can be implemented in the kernel without special hardware operations by adding the thread to the wake list, then suspending the thread, then checking the value. If the value has changed, undo that and return, else just return leaving the thread suspended. Special hardware multi-atomic instructions are not required, but the details are very dependent on how kernel's thread switching is designed.
And that can absolutely save a bunch of system calls, especially vs. polling mixed with `sleep()` or similar.