More

andikleen2 · 2025-12-08T06:54:58 1765176898

Early x86-64 Linux had a similar problem. The x86-64 ABI uses registers for the first 6 arguments. To support variable number of arguments (like printf) requires passing the number of arguments in an extra register (RAX), so that the callee can save the registers to memory for va_arg() and friends. Doing this for every call is too expensive, so it's only done when the prototype is marked as stdarg.

Now the initial gcc implemented this saving to memory with a kind of duffs device, with a computed jump into a block of register saving instructions to only save the needed registers. There was no boundary check, so if the no argument register (RAX) was not initialized correctly it would jump randomly based on the junk, and cause very confusing bug reports.

This bit quite some software which didn't use correct prototypes, calling stdarg functions without indicating that in the prototype. On 32bit code which didn't use register arguments this wasn't a problem.

Later compiler versions switched to saving all registers unconditionally.

veltas · 2025-12-08T08:26:27 1765182387

In the SysV ABI for AMD64 the AL register is used to pass an upper bound on the number of vector registers used, is this related to what you're talking about?

andikleen2 · 2025-03-24T19:58:42 1742846322

> 4. First design that used a wind tunnel to get an efficient wing shape

The Wrights based their wing on Lilienthal's who used a variant of a wind tunnel (as well as actual gliders) for optimizations.

You could also argue he ran a coherent research program too, just it was sadly stopped by his fatal flying accident.

https://www.lilienthal-museum.de/olma/ewright.htm

WalterBright · 2025-03-24T20:04:46 1742846686

Lilienthal's lift/drag numbers turned out to be off by a factor of 2, which is why the Wrights developed their wind tunnel and did exacting experiments to get the correct numbers, and developed the shape of their wing from it.

andikleen2 · 2025-03-24T22:07:40 1742854060

If you read Anderson "A history of Aerodynamics" it disagrees on this point. It states that the Wright's didn't have a good way to calculate drag, and they didn't understand many of the side effects from real wings (like flow separation) which caused wrong measurements initially. Later on they apparently came back to something that was closer to Lilienthal's numbers, even though the problem simply wasn't fully understood at the time.

This paper has a similar conclusion: https://corescholar.libraries.wright.edu/cgi/viewcontent.cgi...

"When the Wright brothers compared their results with those of Lilienthal, they found some disagreement, but not as much as they expected. As Wilbur states in his diary for October 16, 1901: "It would appear that Lilienthal is very much nearer the truth then we have heretofore been disposed to think." [Wolko, 21]. 17 The formulas were still not producing the lift and drag that were actually being produced. The only other possible source of error in these equations was the Smeaton coefficient of air pressure."

WalterBright · 2025-03-24T22:54:19 1742856859

Thanks for the information. The bottom line was the Wrights needed a considerably bigger wing than that predicted by Lilienthal's numbers.

andikleen2 · on March 23, 2025

This technique works with any instruction that clobbers a register, not just with CPUID. In the worst case you could just single step the other CPUs until you hit an instruction that overwrites a register too. These are common.

In my case (for a custom hypervisor for a sadly cancelled project) it wasn't a problem because the hypervisor quit itself in early guest boot, which is single CPU only.

andikleen2 · on July 3, 2023

Having spent a lot of time porting 32bit system code to 64bit, I developed a dislike for these explicit types. It's a slippery slope to hard code your bitness with people making assumptions where size_t or pointers fit.

Now maybe if you're already 64bit that's fine (it's unlikely that we'll ever need 128bit, and code is unlikely to grow down), but for anything starting smaller it's a pain.

tialaramex · on July 4, 2023

Huh?

Zig, like Rust, has two kinds of "primitive" numeric types, the kind which are an explicit size in bits (u8, i16, f64 and so on) and then the word size ones (isize, usize), which are whatever size is suitable for your target machine.

C gets this all muddled because it has named primitive numeric types but their meaning is imprecise, and then it uses a typedef to assign one of these (but you don't know which one) as the native word size. So maybe long is the same as your size_t, and C will just assume you know what you're doing when you write a long where you need a size_t - thus making your code non-portable.

dralley · on July 3, 2023

Just because your code compiles on a new platform doesn't mean the prior assumptions of the prior behavior of the types remains valid.

duped · on July 3, 2023

But it should, and that's what stricter typing can help with.

diarrhea · on July 3, 2023

People are already working on a 128bit Linux kernel.

staunton · on July 3, 2023

The sizes of floating point are growing "apart" in the AI space. Who's to say what might happen with sizes of types. Maybe it's great to use mostly u8 if you're controlling a worldwide asynchronous networked megacomputer.

saagarjha · on July 4, 2023

They’re better than implicit assumptions that int is 32 bits.

andikleen2 · on April 29, 2022

Back when Usenet was still functional, de.sci.history (German history group) had some dedicated proponents of a German conspiracy theory that the early middle ages (Carolingian dynasty in middle Europe) didn't exist. Basically their argument was that there are gaps in the respective documents and also a lot of forgeries, and the best explanation was that a few hundred years were missing. This of course ignored all the carbon dating, and as well as the parallel chronology of other cultures (including early Islam, which fell exactly into the "missing" period)

https://en.wikipedia.org/wiki/Phantom_time_hypothesis

At the beginning I found it amusing, but after some time it got just tiring.

6510 · on April 29, 2022

Ah I [vaguely] remember something about a large gap in German history where no one got a diploma. The argument was that forging lists of names and attributing credentials to them without works or publications was very complicated.

If I remember correctly, it was the same period for which Morozov demonstrated the astronomical records to be way off.

Funny stuff.

andikleen2 · on June 29, 2021

Even if they don't livelock retries can be significantly slower than a direct atomic operation under contention. Compare https://www.cs.tau.ac.il/~mad/publications/ppopp2013-x86queu...

Programming under high cache line contention is like message passing on a really busy network with many nodes, and anything that cuts down round trips can make a big difference in scalability. Most people who do network programming know these lessons by heart, but it's still poorly understood by people doing shared memory.

So maybe it's simpler, but likely it's slower too.

brucehoult · on June 30, 2021

I'm not sure what it is you're arguing is "maybe simpler, but likely slower too". LL/SC?

The paper you're referring to isn't about LL/SC at all. It's about CAS vs fetch-and-add (called AMOADD on RISC-V), showing that F&A scales better than C&S.

LL/SC is referred to only in that some ISAs (including ARM and RISC-V) use LL/SC to implement C&S. But then it goes on to look at x86 exclusively.

Not only the instructions matter here but also implementation.

For example, the way that LL/SC can livelock is that the LL needs to cause that cache line to be acquired for writing, flushing it from other CPUs caches. If another CPU does the same to you before you get to the SC then you lose out and have to loop and try again.

RISC-V provides a "forward progress guarantee" if there are 16 or fewer instructions between the LL and SC, they are "simple" instructions, and they are all in the same cache line.

One simple way to implement this is to delay responding to a cache eviction request for up to 16 clock cycles if you have a LL on that cache line. Then the only way the SC fails (for properly written code) is if you take an interrupt.

The design and implementation of the Atomic Memory Operations is also interesting. RISC-V was co-designed with the TileLink bus. TileLink comes in three flavours:

1) TL-UL Uncached Lightweight: simple get/put transactions

2) TL-UH Uncached Heavyweight: adds support for atomic operations

3) TL-C Cached: adds support for coherent cache blocks

With TL-C, atomic memory operations don't have to be executed by fetching the data all the way to the CPU, do the operation, and write it back. With, for example, an AMOADD (atomic add), both the address and the data to be added are sent over the TL-C bus and the add is executed in a shared L2 or L3 cache or even potentially (it's up to the system designer) directly in the L1 cache of another CPU. Only the result of the add is sent back. The latency is potentially the same as a simple memory read.

TL-C is obviously used inside an SoC, but can also be used over a link such as FMC (FPGA Mezzanine Card), PCIe, or 400G Ethernet (e.g. Western Digital's "OmniXtend").

TL-UH would typically be implemented by peripherals so they can implement AMOs directly on their registers.

TileLink:

https://sifive.cdn.prismic.io/sifive%2Fcab05224-2df1-4af8-ad...

Other buses implement similar features, but this is the one I'm more familiar with.

andikleen2 · on March 5, 2021

>essentially no cost unless the lock is contested.

Atomics are not free. And fetching a cache line can be very expensive, even without contention.

However lock less algorithms are often worse because they have even more atomics and even more cache line transfers. And they're really hard to get right. (iirc there was a statistic somewhere about the average number of corrections in lock less papers -- it wasn't pretty)

Back to the non contended locks. Usually the way to get non contended locks is to push down the locks to smaller and smaller pieces of data. But imagine what happens when you have some contention (e.g. due to a freak hash collisions). Instead of paying the cache transfer cost once you'll pay it many times, and that can really ruin your day (speed of light is very limiting at nano second scales)

I wrote more about this here [1]

I am somewhat biased but I would recommend that every time you consider something lockless, consider using lock elision/HTM instead (see [2] and [3]). It often has most of the advantages with few of the down sides (but admittedly also some of its own quirks)

[1] http://halobates.de/applicative-mental-models.pdf [2] http://queue.acm.org/detail.cfm?id=2579227 [3] http://www.intel.com/software/tsx

wtallis · on March 5, 2021

> I am somewhat biased but I would recommend that every time you consider something lockless, consider using lock elision/HTM instead

Intel shipped two generations of processors with broken TSX before getting it right, and then timing attacks seriously called into question whether it was a safe feature to use in practice. Then Intel dropped it from their first post-Skylake microarchitecture products, though it looks like they're still planning to support it on their next generation of server CPUs. Is it really worth the trouble at this point?

andikleen2 · on March 5, 2021

Yes it totally is.

How much confidence do you have in your lockless code?

wtallis · on March 5, 2021

It's not about confidence in the code. It's about confidence in having hardware that supports TSX, and will also support TSX next year. As far as I can tell, you still need to ensure that you have correct and reasonably efficient fallback code for platforms that don't or no longer have TSX, unless you're working on a project that can get away with very narrow system requirements.

Put simply, TSX is not mainstream, and isn't on track to be mainstream anytime soon.

andikleen2 · on March 5, 2021

>Put simply, TSX is not mainstream, and isn't on track to be mainstream anytime soon.

Being in the vast majority of servers deployed in the last decade (Intel+POWER) is not mainstream?

wtallis · on March 5, 2021

Intel's only been shipping server CPUs with non-defective transactional memory for about five years, and a good chunk of those servers have it disabled for security reasons, or did at some point pending better fixes. IBM has dropped transactional memory from their latest POWER processors. Even if we only look at server CPUs, this does not look like a thriving ecosystem.

andikleen2 · on Feb 1, 2021

In the early tens we ran an engineering project to improve critical sections in applications using transactional memory. We had a PhD level intern applying our techniques to various open source projects. One target was MongoDB. After a few days of investigation of the Mongo source code, he had to give up because he couldn't even find the critical sections in the source. They had locking, but it was extremely convoluted.

So yes I would agree with that. Never use MongoDB.

andikleen2 · on Nov 23, 2020

Seems trivial to exploit the kernel module:

  struct network_activity * activity = NLMSG_DATA(nlh);

<untrusted data from the netlink socket> append_rule(activity->process_path, (activity->allowed == 1)); ...

append_rule: // Don't do anything if the process_path length is > PATH_LENGTH if (strlen(process_path) > PATH_LENGTH) return;

But nobody enforces the process_path has a terminating 0 byte, so likely it can be abused for all kinds of attacks on the kernel. Better don't run it anywhere you care about security.

I found this from about 2 minutes code reading, so likely there wasn't any code audit done ever.

andikleen2 · on Sept 2, 2020

Small mistake in the article. System V code was never released publicly by SCO, just some earlier variants not too far removed from V7.

They make great code reading if anyone is interested. The early Unix variants were quite simple and concise, nothing like the complexity of a modern variant.

https://github.com/dspinellis/unix-history-repo/tree/Researc...

For example early process swapping was really simple:

https://github.com/dspinellis/unix-history-repo/blob/Researc...

or scheduling was quite simple too:

https://github.com/dspinellis/unix-history-repo/blob/Researc...

or that's most of a file system:

https://github.com/dspinellis/unix-history-repo/blob/Researc...

scanner and lexer of the earlier (pre pcc) c compiler:

https://github.com/dspinellis/unix-history-repo/blob/Researc...

and that is ls:

https://github.com/dspinellis/unix-history-repo/blob/Researc...

ed, the standard editor: